Closed kflu closed 6 years ago
Thanks for looking into the issue!
It seems that one of the iterations over the sequence here is really just to get its length so that we can generate keys for the series. While calling Array.ofSeq
definitely does the trick, it sounds like a fairly expensive way of doing this.
I think we could add a way of creating an ordinal series that does not require keys up-front and just generates the ordinal numbers as it iterates over the provided values.
Perhaps I'm worrying too much though. But this sounds like it can be done with one iteration without caching if we change it a bit more. What do you think, @adamklein ?
@tpetricek you're right. I've changed the pull request so now it only requires one pass.
Hmm, I'm afraid this still does not avoid the caching... (That said, I'm really not sure how much this matters in practice....! So perhaps we should just merge your original version :-)).
The current change just delegates the work to the series
function, which calls Series.ofObservations
, which then creates the temp array: https://github.com/BlueMountainCapital/Deedle/blob/5d347cf9329d427e3872c1197303f20554e37a32/src/Deedle/SeriesExtensions.fs#L33
If you look at some of the series constructors, you can see that we need to create "index" and "vector" values for the series.
So, if we wanted to do this with one iteration, we'd first create a "vector" from the rows and then create an ordinal index with integers from 0 the the length of the vector (which we'll know once the vector is created).
Could you try something along the lines of:
let vector = VectorBuilder.Instance.Create rows
let index = IndexBuilder.Instance.Create(seq { 0 .. vector.Length-1 })
FrameUtils.fromRows IndexBuilder.Instance VectorBuilder.Instance Series(index, vector)
?
And thanks very much for contributing! Hopefully my comments here are not too annoying :-)
Sure. Your suggestion looks very reasonable. I will find sometime to send an update to the PR.
@tpetricek OK, I tried your suggestion. VectorBuilder.Instance.Create
actually takes a 'a []
. So we still need to cache the array anyways. If that's the case, we might just revert back to my original version. Let me know what do you think.
I think VectorBuilder.Instance.Create
has one small advantage - in the optimal case, it will just store the array we give it (and just wrap it with the vector object) - in the original solution, we created one array, but then the vector builder would create another array to keep the values internally.
So, doing this saves us one array allocation. (As I said, not sure if it's a big deal or not, but now that you have the potentially more efficient, we can probably go with that!)
The created index is not a seq? How should I fix?
Severity Code Description Project File Line Suppression State
Error Possible overload: 'new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>'. Type constraint mismatch. The type
Indices.IIndex<int64>
is not compatible with type
seq<'a>
The type 'Indices.IIndex<int64>' is not compatible with the type 'seq<'a>'. Deedle C:\Users\kelu\work\Deedle\src\Deedle\FrameExtensions.fs 516
Fixed by #396
This addresses issue #356
Frame.ofRowsOrdinal
iterates the sequence TWICE. This is a big issue when the implementation of the sequence involves some resources at may only be able to be iterated once, for example, an IDataReader instance from a database connection.While it can be worked around by caching the sequence into an array before feeding it to Frame.ofRowsOrdinal. But in general, the common practice of working with seq and IEnumerable is that the function that takes the sequence (the callee) should take care of caching if it is to be iterates multiple times.