fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
618 stars 42 forks source link

Compare and contrast with parquet #129

Open xiaodaigh opened 6 years ago

xiaodaigh commented 6 years ago

It'a also ondisk and allows for compression. Wonder what are the similarities and differences

MarcusKlik commented 6 years ago

Hi @xiaodaigh, thanks for a very interesting question!

Parguet and Arrow are two Apache projects designed to work efficiently with columnar formats.

The fstlib library (and with it the fst package) has a different goal, although it shares some of the design principles with Arrow and Parguet:

The philosophy behind fstlib is to make the most of the computer component that is currently showing the fastest evolution (by far): the SSD drive. The gains in speed, access times and storage capacity are huge and with 3XPoint memory maturing and stacked 3D V-NAND taking over, that growth is far from over.

With such fast storage devices, it becomes feasible to work on very big datasets by using nothing more than a fast storage device and a consumer grade computer.

For example, you can store a 10 column by 10-billion row integer table in a ~50 GB (compressed) fst file. With the latest SSD's, you can read such a file completely in less than half a minute. However, most computers don't have the RAM to support that. But you can read pieces of the table, reading a single column from that table will only take a few seconds (with a random access format like fst). That's fast enough to do real data science computations on that table. If you do your calculations on different (background) threads while reading (chunks of) data, the speed of working with that table will still be very high, even higher than if you would first import the data to an in-memory structure and calculate on it only then using a separate library.

So that's an example of the practical goal of fst: work on a multi billion row data table from your mid-range laptop using nothing more than a fast disk (or 3 NVME disks in RAID 0, your choice :-)). That will open up working with 'big-data' (or better 'large-data') to many data scientist that don't have (or want) access to large systems!

xiaodaigh commented 6 years ago

Yes. I am writing a blog post to analyse the Freddie Mae data (1.8 billion rows) with disk.frame with fst as back-end. It will be an awesome showcase of fst. If this is packaged with better large-database connectivity and distributed then there is a lot of commercial value! There is a company in this!

MarcusKlik commented 6 years ago

Thanks and very interesting showcase, please make sure you post a reference once it is finished!

xiaodaigh commented 5 years ago

I have created a vignette but it's not quite the one that deals with Fannie Mae data yet.

MarcusKlik commented 5 years ago

Hi @xiaodaigh, I already saw a lot of activity on your package :-) Great work, disk.frame seems very feature rich already, are you almost releasing to CRAN ?

I'm going to study your vignette, looks very nice already.

Thanks for the heads-up!