Closed adamklein closed 10 years ago
Just a random aside:
let tups = ([|0 .. 1000000|] |> Array.zip [|0 .. 1000000|])
let s1 = series tups
let arr = [|1 .. 1000001|]
let s2 = s1 |> Series.realign arr
The final operation takes approx 8s on my machine, whereas the equivalent pandas code:
s1 = Series([i for i in xrange(1000000)])
ix = [i for i in xrange(1, 1000001)]
s1.reindex(ix)
The final statement takes 200ms, or, if you let ix = Index([...]), it takes 60ms.
Essentially, assuming pandas as a speed limit, we have 100x relative speed improvement possibility. Just food for thought.
Sad we are so far from Pandas in this case. We should clearly be able to bridge the gap...
One key to speed in the example above that pandas (and saddle) use is when realigning one ordered index to another, you don't need dictionary lookup - you can instead just walk the indices and values. Deedle invokes a lot of extra generalized machinery - eg, there's no special casing on ordered indices, and it uses joining logic to create a representation of the results to pass to the builder, when it would be much faster just to build the results in situ. I'd be curious to get thoughts from @tpetricek. I think realigning ordered indices is a common enough use case that it should be optimized.
The Address
abstraction is a bit broken - but I'm not entirely sure about how to best fix it. The original motivation was that we should be able to use multiple different types for addressing from index to vector.
The current implementation really only uses int
(supporting other types would need some work - but at least the Address
type clearly tells us where). The other type I had in mind was int64
for big data structures - I think supporting that would be quite useful. I was vaguely thinking that it might be useful to allow other types of indices (if we were thinking about continuous series or geo-spatial data) but that is a bit odd anyway and we might not want to support that anyway (and focus on doing other things well).
So, I think restricting the address to just int
would be unfortunate (I think some users like @Rickasaurus might need more addresses than that). But perhaps we could just get rid of the type and always use int64
? That would definitely simplify the code and I don't think we'd lose anything... (assuming int64
is as fast as int
).
(Maybe there is some nice abstraction that would work better then the current one - but since we do not have other use cases for this at the moment, I think dropping this abstraction layer might be the best option here... If we found other use in the future, we can re-add it once we actually know what we need...)
I'd be in favor of dropping the premature abstraction and adding something back when we actually need it.
Agree with @hmansell. I cannot think of a reason IVector cells should not be addressed via integer offsets. I think that any other addressing scheme indicates you should be using a Series, which naturally combines non-integer addressing and vector-like operations. int64 should be fine for speed, the only drawback is memory considerations which hasn't been a design goal.
This leads to another question about the semantics: Deedle LinearIndex is designed differently than the pandas/saddle index in that the offset of the key doesn't necessarily reflect the address, ie the index ["a", "b"] might refer to vector locations [10, 20]. This is nice in that it gives a natural way to have series with vectors having length different than the index. However, I think it would be much more efficient operationally to have the IVector abstract the offset -> data mapping, rather than the Index, making the index simply an ordered sequence of keys instead of (key,address) tuples. In other words, a IVector.[0] may possibly map to a value sitting at array cell offset 10 in memory/disk, but an IIndex.[0] key always refers to IVector.[0], and IIndex length must == Vector length. I believe this foundation will allow much faster index processing.
Open to counterarguments -
Regarding address - yes, I think using int64
is a way to go! It might still make sense to keep this in the addressing module, but we can make the Address
type just an alias for int64
(Keeping the functionality together in a module should let us easily see the overhead for int
vs. int64
and it might be a bit more readable.) But I agree this is a great thing to do!
Regarding linear index - my first thought was that I agree that the index should be mapping from [0 ... elements.Count - 1]. In fact, I think this is how it actually works at the moment (but it is not treated as an invariant that is assumed to hold - at least I think). The only case why we might want to do something different is when you perform some slicing or grouping and want to avoid copying the data - but I think this is better done by implementing a separate kind of index.
So, my suggestion would be to make:
LinearIndex
- always maps keys to indices [0 .. length-1] - I think this is currently the case, but we can make this an explicit invariant and rely on this (to make the implementation better). Currently, the type keeps both a dictionary (for fast lookup) and a (lazily constructed) array with all the keys for scan-like search in an ordered index (i.e. it performs binary search when looking for a nearest smaller/greater key than the specified key).As needed, we could add something like:
RangeIndex
(akin to ArraySegment) to capture sub-index of a larger index - this does not necessarily have to implement all index operations, because it can always lazily copy the underlying vector & index (to construct real LinearIndex
) but I suppose many of the common operations would be easy to implement...Hope this helps & makes sense!
BTW: Sorry for my sluggishness in reviewing and contributing any code! I was hoping to be able to do more over the break (as usual :-)), but I'll try to - at least - review the pending pull requests. Also, Happy New year to everyone!
Following up a somewhat strange example in which @dshiber inadvertantly built a 18.6 million observation timeseries, I saw that there is a fat hot spot in two places in LinearIndex.fs, the first where addresses is constructed on line 51:
Here, if keys is an array, a scan over the whole array is wasted via Seq.length; instead you could do:
that dispatches to the array count function... The other is place is
If addresses were just implicitly an enumeration from 0 to number of keys, you could do this instead:
Which is much faster than zipping and unpacking the tuples. And in general, picking up the discussion here:
http://stackoverflow.com/questions/18815210/does-using-single-case-discriminated-union-types-have-implications-on-performanc
@tpetricek, if I ignore the array allocation, I get timing as follows - 300% performance penalty, rather than the 50% you note.
I'm having trouble seeing the abstraction benefit to Address given that seq and array are int32 index based, the main Deedle data types are hardwired to use linear index builder and array vector builder, and lookup functions such as GetKeyAt generally take int32 anyway... I mean, I get the idea, but the address abstraction just seems too leaky to me vs the performance benefit of getting rid of it.
Thoughts @hmansell @tpetricek ?