fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
933 stars 195 forks source link

Series.shift aliases and some other issues #306

Closed buybackoff closed 9 years ago

buybackoff commented 9 years ago

With the spirit of "Don't make me think", I think it will be useful to have Series.shiftPrevious and Series.shiftNext that accept only positive numbers. I had to read the description more than twice and still was not comfortable without a test if I am looking into the future or the past... Another option is to use a name lag that only accepts positive values for history, to protect people from looking into future during backtests. It is too easy to make a mistake now with +1/-1 and then hard to find it. One could argue that something[-1] is more natural so the shift should use -1 for history. What do you think?

tpetricek commented 9 years ago

I think that adding nicer aliases for shift -1 and shift +1 would be useful.

Although, I think it is hard to come up with good names for those! I think finance people will probably know lag, so perhaps that would be good addition? As for shiftPrevious and shiftNext, I think those are very similar to shift -1 and shift +1 (and I'm not sure if they help much?). At least, I don't really have a good intuition whether these are shifting forward/backward the values or the keys of the series. So something like shiftValuesForward might be more descriptive, but then - it's ugly as hell!

Perhaps @hmansell or @adamklein have some opinions on this?

buybackoff commented 9 years ago

May be very explicit comment in the intellisense doc could be enough, e.g. "for historical values, use positive numbers". The word lag implies history, and using uint for parameter will remove ambiguity completely.

tpetricek commented 9 years ago

I think those two would be great

tpetricek commented 9 years ago

So, if you can send PR adding those two, that would be amazing!

buybackoff commented 9 years ago

Will do next week

buybackoff commented 9 years ago

A week was too optimistic after I defined lag as a simple function and solved the issue locally, sorry...

Some more issues: I miss overlapped windows (like here), stats for product, JSON.NET serialization and want to try Yeppp for boosting speed.

Overlapped windows are a general case for windows/chunks with a size and a step. When step = 1 - it is a window, when step=size - it is a chunk. Now I do this with size as an integer multiple of steps, then calculate chunks of step size, and then a process windows of the multiple of chunks. It works, but ugly.

Product operations are surprisingly missing from moving/expanding in Stats.fs. I do it now as let index = log (return + 1.0) |> Stats.expandingSum |> Series.Exp or (fun s -> Math.Exp <| (s.Data + 1.0).Log().Sum() ) and could live without built-in product for a while, but they are often needed in a fast simple form.

JSON.NET (v.6), despite being so versatile, cannot serialize series out of the box. I haven't looked deeper why and use an intermediary DTO with key/values array to feed to JSON.NET, but serialization other than .csv is useful.

There is a nice package https://www.nuget.org/packages/Yeppp.CLR.Bundle/ that works on all platforms and gives 5-10x performance boost for double[] operations with 100+ elements (in isolated tests). Now I call many operations such as log, sum, add, product on series much longer than 100 in a tight loop, and using a SIMD-optimized library seems like a free lunch. Especially when I what to minimize design time and get quick and dirty, but working within reasonable time, code.

Finally, documentation for chunkSizeInto says:

The key is the last key of the chunk, unless boundary behavior is Boundary.AtEnding (in which case it is the first key).

But the line always takes the first one and should be instead:

  [<CompiledName("ChunkSizeInto")>]
  let inline chunkSizeInto bounds f (series:Series<'K, 'T>) : Series<'K, 'R> =
    series.Aggregate(
      ChunkSize(bounds), 
      (fun d -> d.Data.Keys |> if (snd bounds) = Boundary.AtEnding then Seq.head else Seq.last), 
      fun ds -> OptionalValue(f ds))

I think that the documented behavior is the correct one.

I won't tell any date now, but I want to do this in random priority at some point when I finish a project at work. I will keep posting issues here if I find more. What do you think about these issues?

tpetricek commented 9 years ago

Thanks for the fantastic feedback and suggestions :-).

Overlapped windows

This sounds very useful - and the fact that it generalizes windowing & chunking is nice too. It would be nice to have this! (Though there are some more tricks around chunking - you can tell it to create chunk from the end and have incomplete chunks at the beginning - but I guess that can be done with overlapping windows too...)

product operations

This looks like a nice addition too!

JSON.NET 6

Supporting JSON.NET serialization would be nice. I wouldn't want to have a dependency on it, but if we can figure out what is it that's breaking it (and fix that), that would be great!

Yeppp

This looks interesting. Probably something that we should investigate! There are some performance tools in the tests folder that can be used to compare the speed of various operations before & after..

So, any PRs for any of these would be more then welcome :-)

buybackoff commented 9 years ago

What about chunkSizeInto? I have already changed the behavior locally according to documentation and hope that master/nuget will be the same rather than documentation changed.

buybackoff commented 9 years ago

However, all other window... and chunk... functions are consistent in returning the first key. So changing the comment in chunkSizeInto is probably better.

buybackoff commented 9 years ago

Series silently accepts empty arrays in constructor

var dates = new DateTime[100];
var values = new double[100];
var series = new Deedle.Series<DateTime, double>(dates, values); // expect to throw here
var enumerable = series.Keys;
var newArr = enumerable.ToArray(); // but throws here