fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
929 stars 196 forks source link

Support for the Feather format #343

Open evelinag opened 8 years ago

evelinag commented 8 years ago

Feather is a recently introduced fast binary format for storing data frames. It's language agnostic and it can be currently used to load data frames into R and Python. It would be great to have a support for this format in Deedle as well, to allow exchanging data with R and Python code.

For more information see: blog.rstudio.org/2016/03/29/feather Feather source code: github.com/wesm/feather

adamklein commented 8 years ago

Great idea!

buybackoff commented 8 years ago

It would be cool to reuse your FlatBuffer project to automatically map .NET's primitive types and structs (and maybe POCOs) to the Arrow format, since it uses FlatBuffers as well. And then to keep .NET<->Arrow as a reusable module and build Feather on the top of it. Feather is too specific for data frames, while the Arrow format could be used for chunks/block storage. I have been investigating for a while how to adapt it for Spreads library, and I am very interested in the .NET port. What do you think would be easier/feasible - C interface with P/Invoke or native rewrite in F#/C#?

pkese commented 5 years ago

@buybackoff
Do you have any idea how https://github.com/kevin-montrose/FeatherDotNet would fit into Deedle's internals?

buybackoff commented 5 years ago

@pkese I'm not the one to talk about Deedle internals, but

My current take on it that the physical binary layout doesn't matter much, there is no a silver bullet, but I'm biased. I'm doing well with SQLite and LMDB and store data blocks as just shuffled+compressed blobs. SQLite is damn fast, SSD write speed is the limit when writing moderate size chunks. LMDB is much faster for reads. Zstd compression often makes IO faster - savings on data size and read/write time are bigger that CPU spent on (de)compression.

In the end it is just blobs with headers laid out sequentially with some indexing. Anything will do many orders of magnitude better than csv/json. Arrow is more like well-specified common sense and not something unique. Uniqueness is that the very big Apache ecosystem has agreed upon that standard.

Please sign up for announcements here if you are interested in very fast persistence for real-time data streams, series, matrices and frames. I have it partially working in a private repo and hope to release soon for a general use case. I will implement ML.NET's IDataView rather than Arrow on the top of my very simple physical layout that resembles Arrow a lot conceptually.

buybackoff commented 5 years ago

Relevant issue in ML.NET: https://github.com/dotnet/machinelearning/issues/1860

ML.NET already has Parquet loader: https://github.com/dotnet/machinelearning/tree/master/src/Microsoft.ML.Parquet.

And now we have Feather, Arrow, Parquet, learn how they differ or just names/implementations of the same thing... And then comes IDataView that promises to standardize all the standards. Xkcd link above is so relevant here :)