fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
941 stars 197 forks source link

'Address' discriminated union & library performance #113

Closed adamklein closed 10 years ago

adamklein commented 10 years ago

Following up a somewhat strange example in which @dshiber inadvertantly built a 18.6 million observation timeseries, I saw that there is a fat hot spot in two places in LinearIndex.fs, the first where addresses is constructed on line 51:

let addresses = Address.generateRange(Address.rangeOf(keys))

Here, if keys is an array, a scan over the whole array is wasted via Seq.length; instead you could do:

let addresses = Address.generateRange((Int 0, keys.Count() |> Int))

that dispatches to the array count function... The other is place is

do for k, v in mappings do ...

If addresses were just implicitly an enumeration from 0 to number of keys, you could do this instead:

let mutable c = 0
do for k in keys do 
  match lookup.TryGetValue(k) with
  ...
  | _ -> 
      lookup.[k] <- (Int c)
      c <- c + 1

Which is much faster than zipping and unpacking the tuples. And in general, picking up the discussion here:

http://stackoverflow.com/questions/18815210/does-using-single-case-discriminated-union-types-have-implications-on-performanc

@tpetricek, if I ignore the array allocation, I get timing as follows - 300% performance penalty, rather than the 50% you note.

#time
type Foo = Foo of int
let foos = Array.init 1000000 Foo

// 4seconds on my machine
for i in 0 .. 1000 do 
   let mutable total = 0
   for Foo f in foos do total <- total + f

let bars = Array.init 1000000 id    

// 1second on my machine
for i in 0 .. 1000 do 
  let mutable total = 0
  for b in bars do total <- total + b

I'm having trouble seeing the abstraction benefit to Address given that seq and array are int32 index based, the main Deedle data types are hardwired to use linear index builder and array vector builder, and lookup functions such as GetKeyAt generally take int32 anyway... I mean, I get the idea, but the address abstraction just seems too leaky to me vs the performance benefit of getting rid of it.

Thoughts @hmansell @tpetricek ?

adamklein commented 10 years ago

Just a random aside:

let tups = ([|0 .. 1000000|] |> Array.zip [|0 .. 1000000|])
let s1 = series tups
let arr = [|1 .. 1000001|]
let s2 = s1 |> Series.realign arr

The final operation takes approx 8s on my machine, whereas the equivalent pandas code:

s1 = Series([i for i in xrange(1000000)])
ix = [i for i in xrange(1, 1000001)]
s1.reindex(ix)

The final statement takes 200ms, or, if you let ix = Index([...]), it takes 60ms.

Essentially, assuming pandas as a speed limit, we have 100x relative speed improvement possibility. Just food for thought.

hmansell commented 10 years ago

Sad we are so far from Pandas in this case. We should clearly be able to bridge the gap...

adamklein commented 10 years ago

One key to speed in the example above that pandas (and saddle) use is when realigning one ordered index to another, you don't need dictionary lookup - you can instead just walk the indices and values. Deedle invokes a lot of extra generalized machinery - eg, there's no special casing on ordered indices, and it uses joining logic to create a representation of the results to pass to the builder, when it would be much faster just to build the results in situ. I'd be curious to get thoughts from @tpetricek. I think realigning ordered indices is a common enough use case that it should be optimized.

tpetricek commented 10 years ago

The Address abstraction is a bit broken - but I'm not entirely sure about how to best fix it. The original motivation was that we should be able to use multiple different types for addressing from index to vector.

The current implementation really only uses int (supporting other types would need some work - but at least the Address type clearly tells us where). The other type I had in mind was int64 for big data structures - I think supporting that would be quite useful. I was vaguely thinking that it might be useful to allow other types of indices (if we were thinking about continuous series or geo-spatial data) but that is a bit odd anyway and we might not want to support that anyway (and focus on doing other things well).

So, I think restricting the address to just int would be unfortunate (I think some users like @Rickasaurus might need more addresses than that). But perhaps we could just get rid of the type and always use int64? That would definitely simplify the code and I don't think we'd lose anything... (assuming int64 is as fast as int).

(Maybe there is some nice abstraction that would work better then the current one - but since we do not have other use cases for this at the moment, I think dropping this abstraction layer might be the best option here... If we found other use in the future, we can re-add it once we actually know what we need...)

hmansell commented 10 years ago

I'd be in favor of dropping the premature abstraction and adding something back when we actually need it.

adamklein commented 10 years ago

Agree with @hmansell. I cannot think of a reason IVector cells should not be addressed via integer offsets. I think that any other addressing scheme indicates you should be using a Series, which naturally combines non-integer addressing and vector-like operations. int64 should be fine for speed, the only drawback is memory considerations which hasn't been a design goal.

This leads to another question about the semantics: Deedle LinearIndex is designed differently than the pandas/saddle index in that the offset of the key doesn't necessarily reflect the address, ie the index ["a", "b"] might refer to vector locations [10, 20]. This is nice in that it gives a natural way to have series with vectors having length different than the index. However, I think it would be much more efficient operationally to have the IVector abstract the offset -> data mapping, rather than the Index, making the index simply an ordered sequence of keys instead of (key,address) tuples. In other words, a IVector.[0] may possibly map to a value sitting at array cell offset 10 in memory/disk, but an IIndex.[0] key always refers to IVector.[0], and IIndex length must == Vector length. I believe this foundation will allow much faster index processing.

Open to counterarguments -

tpetricek commented 10 years ago

Regarding address - yes, I think using int64 is a way to go! It might still make sense to keep this in the addressing module, but we can make the Address type just an alias for int64 (Keeping the functionality together in a module should let us easily see the overhead for int vs. int64 and it might be a bit more readable.) But I agree this is a great thing to do!

Regarding linear index - my first thought was that I agree that the index should be mapping from [0 ... elements.Count - 1]. In fact, I think this is how it actually works at the moment (but it is not treated as an invariant that is assumed to hold - at least I think). The only case why we might want to do something different is when you perform some slicing or grouping and want to avoid copying the data - but I think this is better done by implementing a separate kind of index.

So, my suggestion would be to make:

As needed, we could add something like:

Hope this helps & makes sense!

BTW: Sorry for my sluggishness in reviewing and contributing any code! I was hoping to be able to do more over the break (as usual :-)), but I'll try to - at least - review the pending pull requests. Also, Happy New year to everyone!