fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
941 stars 197 forks source link

Series and Frames for real-time streaming data #51

Closed buybackoff closed 8 years ago

buybackoff commented 11 years ago

What would be the right way to use Series in a real-time environment where new data arrive asynchronously?

I have found a question (and probably a part of an answer) that describes exactly the idea. http://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data

The answers on SO suggest using FSharpx.Collections.Vector<T> data structure instead of arrays. Another answer (http://stackoverflow.com/a/19520214/801189) on SO by @tpetricek explains why arrays are faster than lists for fixed data, and I believe that was one of the reason for initial implementation of Vector as ArrayVector in Deedle. I think the current focus of Deedle is to deal with fixed existing data series and frame - the workflow much similar to R. But if the data length is fixed then the performance is less important that in a real-time environment.

For streaming data we need to append existing series with new value(s) and use the new series. With current array implementation that will require copying the whole old array to the new resized array. In the first question the author mentions 5 mn data point per instrument per day (let's assume 8 bytes double + DateTime's 8 bytes), or around 80 Mb per instrument. With e.g. 100 instruments copying all arrays many times per second is probably not the best option.

Simplest use case For stock price A with 1 second interval we calculate 60-second moving average and store it in a series MA_A_60. We update all vectors as new data points arrive.

  1. For a new price point we create a new series object by appending the old object (in the case of a very large data set copying array is slow)
  2. Then we take last 60 values from the new series object and calculate new MA value (crucial point is to avoid recalculation for all MA values, but take only the last 60-point window from A)
  3. We append new MA value to the MA_A_60.

Will the current implementation be suitable for such workflow for hundreds of instruments, multiple calculated values for each one and sub-second frequency?

Will an implementation of Deedle's IVector with FSharpx.Collections.Vector be more suitable for such use case? (I know one should run some tests in a similar situation, but there is no second implementation to compare with)

I would love to have Deedle's abstraction and API for such use case!

P.S. An abstraction of the workflow: if seriesB = f seriesA, then we could somehow link series B to series A, watch for new values in A and add the new values to B (applying f function only for incremental data). For this we would need some projection object that would keep seriesB always synchronized with seriesA using the transformation function f. In turn, there could be some seriesC = f2 seriesB on so on. I am not sure that this functionality should be inside the library, but that is what I hope to achieve.

hmansell commented 10 years ago

My guess is that the current implementation would not be suitable - but please feel free to try it!

As you say, copying everything doesn't make sense for this application. Ideally, you would want changes to propagate down the chain of operations and do calculations incrementally, which would require different abstractions.

We're working on some real-time stuff at BlueMountain and going about it quite a different way.

sirinath commented 9 years ago

Is it possible to consider open sourcing the real time stuff you guys have done internally?

hmansell commented 9 years ago

The implementation is too coupled to our internal infrastructure to allow us to do that, unfortunately.

buybackoff commented 8 years ago

Done.

buybackoff commented 8 years ago

Here it is: https://github.com/Spreads/Spreads

sirinath commented 8 years ago

WOW. Good stuff.

Also can you make the license more more permissive license.