fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
929 stars 196 forks source link

Don't iterate sequence twice in Frame.ofRowsOrdinal #357

Closed kflu closed 6 years ago

kflu commented 7 years ago

This addresses issue #356

Frame.ofRowsOrdinal iterates the sequence TWICE. This is a big issue when the implementation of the sequence involves some resources at may only be able to be iterated once, for example, an IDataReader instance from a database connection.

While it can be worked around by caching the sequence into an array before feeding it to Frame.ofRowsOrdinal. But in general, the common practice of working with seq and IEnumerable is that the function that takes the sequence (the callee) should take care of caching if it is to be iterates multiple times.

tpetricek commented 7 years ago

Thanks for looking into the issue!

It seems that one of the iterations over the sequence here is really just to get its length so that we can generate keys for the series. While calling Array.ofSeq definitely does the trick, it sounds like a fairly expensive way of doing this.

I think we could add a way of creating an ordinal series that does not require keys up-front and just generates the ordinal numbers as it iterates over the provided values.

Perhaps I'm worrying too much though. But this sounds like it can be done with one iteration without caching if we change it a bit more. What do you think, @adamklein ?

kflu commented 7 years ago

@tpetricek you're right. I've changed the pull request so now it only requires one pass.

tpetricek commented 7 years ago

Hmm, I'm afraid this still does not avoid the caching... (That said, I'm really not sure how much this matters in practice....! So perhaps we should just merge your original version :-)).

The current change just delegates the work to the series function, which calls Series.ofObservations, which then creates the temp array: https://github.com/BlueMountainCapital/Deedle/blob/5d347cf9329d427e3872c1197303f20554e37a32/src/Deedle/SeriesExtensions.fs#L33

If you look at some of the series constructors, you can see that we need to create "index" and "vector" values for the series.

So, if we wanted to do this with one iteration, we'd first create a "vector" from the rows and then create an ordinal index with integers from 0 the the length of the vector (which we'll know once the vector is created).

Could you try something along the lines of:

let vector = VectorBuilder.Instance.Create rows
let index = IndexBuilder.Instance.Create(seq { 0 .. vector.Length-1 })
FrameUtils.fromRows IndexBuilder.Instance VectorBuilder.Instance  Series(index, vector)  

?

And thanks very much for contributing! Hopefully my comments here are not too annoying :-)

kflu commented 7 years ago

Sure. Your suggestion looks very reasonable. I will find sometime to send an update to the PR.

kflu commented 7 years ago

@tpetricek OK, I tried your suggestion. VectorBuilder.Instance.Create actually takes a 'a []. So we still need to cache the array anyways. If that's the case, we might just revert back to my original version. Let me know what do you think.

tpetricek commented 7 years ago

I think VectorBuilder.Instance.Create has one small advantage - in the optimal case, it will just store the array we give it (and just wrap it with the vector object) - in the original solution, we created one array, but then the vector builder would create another array to keep the values internally.

So, doing this saves us one array allocation. (As I said, not sure if it's a big deal or not, but now that you have the potentially more efficient, we can probably go with that!)

kflu commented 7 years ago

The created index is not a seq? How should I fix?

image

Severity    Code    Description Project File    Line    Suppression State
Error       Possible overload: 'new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>'. Type constraint mismatch. The type 
    Indices.IIndex<int64>    
is not compatible with type
    seq<'a>    
The type 'Indices.IIndex<int64>' is not compatible with the type 'seq<'a>'. Deedle  C:\Users\kelu\work\Deedle\src\Deedle\FrameExtensions.fs 516 
dsyme commented 6 years ago

Fixed by #396