fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
929 stars 196 forks source link

Performance bottlenecks #329

Closed buybackoff closed 7 years ago

buybackoff commented 8 years ago

With release of Spreads I tried to avoid posting in Readme the benchmarks that I have been seeing over the last months (but couldn't resist to mention them briefly). I thought that either I was doing something wrong, or NuGet packages contain Debug builds, or something else... This test gives c.150x performance difference on two machines. Yes, it is trivial, but that is the point - if arithmetic operations slow me down so much, I am loosing any hope that powerful boxes could save me from the curse of dimensionality during discrete optimization. There are other benchmarks where Deedle's Zip of two series is 10+ slower even with equal keys optimization that is not yet implemented in Spreads. That just could not be right and there must be a very low hanging fruit for Deedle.

Also eagerness of DataSegments in Windows is (or was in the summer) a real issue. In Deedle, there are already some virtual classes that use slices of series or something similar to .NET's ArraySegment. Given that Series are immutable in Deedle, it must be easy and safe to return some lazy/virtual objects instead of materialized series segments in Windows. We had an issue when a 32Gb box blew up with OOM and could not understand why, but then profiled memory consumption and eagerness was the case.

Another general issue is that I have really tried to fix Deedle and to understand what is going on, but it is too complex internally and tries to be pure functional when it doesn't need to be (Series are already immutable, everything else internally could be mutable, imperative and therefore fast). I hope that I do not sound too harsh on all these issues and that was definitely not my intent - I like Deedle, am still learning from it, and use it, and these issues are from real-world usage. But going after Pandas implementation (and there is a Word .doc about this, Deedle was modeled after Pandas) is probably not a good idea on a much more advanced .NET platform, given that Pandas library is just an Python interface to C code underneath.

buybackoff commented 8 years ago

I have repeated some tests on Mono and Deedle is 10x faster there on a virtual box compared to .NET on the same machine. Details are here.