Open xiaodaigh opened 6 years ago
Hi @xiaodaigh, thanks for a very interesting question!
Parguet and Arrow are two Apache projects designed to work efficiently with columnar formats.
The fstlib
library (and with it the fst
package) has a different goal, although it shares some of the design principles with Arrow and Parguet:
fstlib
aims to provide a framework for working with on-disk data by using as little memory and disk space as possible while keeping processing performance at a high level. It does that by utilizing compression, random access and multi-treading as efficiently as possible. In effect, it's Parguet and Arrow combined in a single package. That's an important difference: with fst
there is no distinction between an in-memory data structure and an on-disk data structure, they are one and the same. With fstlib
you will be able to perform calaculations directly on the compressed fst
format, and that structure can be in-memory or on-disk, they will be 100 percent identical. The goal is to perform efficient operations directly on little pieces of data from that structure, so selectively unpacking and processing them one-by-one. fstlib
will provide an API to unpack and work with these pieces (or complete columns) in the native memory structures of R
for example (zero copy like Arrow).The philosophy behind fstlib
is to make the most of the computer component that is currently showing the fastest evolution (by far): the SSD drive. The gains in speed, access times and storage capacity are huge and with 3XPoint memory maturing and stacked 3D V-NAND taking over, that growth is far from over.
With such fast storage devices, it becomes feasible to work on very big datasets by using nothing more than a fast storage device and a consumer grade computer.
For example, you can store a 10 column by 10-billion row integer table in a ~50 GB (compressed) fst
file. With the latest SSD's, you can read such a file completely in less than half a minute. However, most computers don't have the RAM to support that. But you can read pieces of the table, reading a single column from that table will only take a few seconds (with a random access format like fst
). That's fast enough to do real data science computations on that table. If you do your calculations on different (background) threads while reading (chunks of) data, the speed of working with that table will still be very high, even higher than if you would first import the data to an in-memory structure and calculate on it only then using a separate library.
So that's an example of the practical goal of fst
: work on a multi billion row data table from your mid-range laptop using nothing more than a fast disk (or 3 NVME disks in RAID 0, your choice :-)). That will open up working with 'big-data' (or better 'large-data') to many data scientist that don't have (or want) access to large systems!
Yes. I am writing a blog post to analyse the Freddie Mae data (1.8 billion rows) with disk.frame
with fst
as back-end. It will be an awesome showcase of fst
.
If this is packaged with better large-database connectivity and distributed then there is a lot of commercial value! There is a company in this!
Thanks and very interesting showcase, please make sure you post a reference once it is finished!
I have created a vignette but it's not quite the one that deals with Fannie Mae data yet.
Hi @xiaodaigh, I already saw a lot of activity on your package :-)
Great work, disk.frame
seems very feature rich already, are you almost releasing to CRAN ?
I'm going to study your vignette, looks very nice already.
Thanks for the heads-up!
It'a also ondisk and allows for compression. Wonder what are the similarities and differences