Compare and contrast with parquet

xiaodaigh commented 6 years ago

It'a also ondisk and allows for compression. Wonder what are the similarities and differences

MarcusKlik commented 6 years ago

Hi @xiaodaigh, thanks for a very interesting question!

Parguet and Arrow are two Apache projects designed to work efficiently with columnar formats.

Parguet is an on-disk format that can store (possibly compressed) columnar data and is aimed at maximizing the serialization performance.
Arrow is a columnar format that provides an in-memory representation of a table or vectors. It's aimed at runtime performance.

The fstlib library (and with it the fst package) has a different goal, although it shares some of the design principles with Arrow and Parguet:

fstlib aims to provide a framework for working with on-disk data by using as little memory and disk space as possible while keeping processing performance at a high level. It does that by utilizing compression, random access and multi-treading as efficiently as possible. In effect, it's Parguet and Arrow combined in a single package. That's an important difference: with fst there is no distinction between an in-memory data structure and an on-disk data structure, they are one and the same. With fstlib you will be able to perform calaculations directly on the compressed fst format, and that structure can be in-memory or on-disk, they will be 100 percent identical. The goal is to perform efficient operations directly on little pieces of data from that structure, so selectively unpacking and processing them one-by-one. fstlib will provide an API to unpack and work with these pieces (or complete columns) in the native memory structures of R for example (zero copy like Arrow).

The philosophy behind fstlib is to make the most of the computer component that is currently showing the fastest evolution (by far): the SSD drive. The gains in speed, access times and storage capacity are huge and with 3XPoint memory maturing and stacked 3D V-NAND taking over, that growth is far from over.

With such fast storage devices, it becomes feasible to work on very big datasets by using nothing more than a fast storage device and a consumer grade computer.

For example, you can store a 10 column by 10-billion row integer table in a ~50 GB (compressed) fst file. With the latest SSD's, you can read such a file completely in less than half a minute. However, most computers don't have the RAM to support that. But you can read pieces of the table, reading a single column from that table will only take a few seconds (with a random access format like fst). That's fast enough to do real data science computations on that table. If you do your calculations on different (background) threads while reading (chunks of) data, the speed of working with that table will still be very high, even higher than if you would first import the data to an in-memory structure and calculate on it only then using a separate library.

So that's an example of the practical goal of fst: work on a multi billion row data table from your mid-range laptop using nothing more than a fast disk (or 3 NVME disks in RAID 0, your choice :-)). That will open up working with 'big-data' (or better 'large-data') to many data scientist that don't have (or want) access to large systems!

xiaodaigh commented 6 years ago

Yes. I am writing a blog post to analyse the Freddie Mae data (1.8 billion rows) with disk.frame with fst as back-end. It will be an awesome showcase of fst. If this is packaged with better large-database connectivity and distributed then there is a lot of commercial value! There is a company in this!

MarcusKlik commented 6 years ago

Thanks and very interesting showcase, please make sure you post a reference once it is finished!

xiaodaigh commented 5 years ago

I have created a vignette but it's not quite the one that deals with Fannie Mae data yet.

MarcusKlik commented 5 years ago

Hi @xiaodaigh, I already saw a lot of activity on your package :-) Great work, disk.frame seems very feature rich already, are you almost releasing to CRAN ?

I'm going to study your vignette, looks very nice already.

Thanks for the heads-up!

fstpackage / fst

Compare and contrast with parquet #129