fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
614 stars 42 forks source link

Some questions regarding creating fst files from C++ #139

Open statquant opened 6 years ago

statquant commented 6 years ago

Dear @MarcusKlik first I'd like to thank you for providing fst to the R community, this is a truly amazing package and I can see it becoming what data.table or Rcpp can be in their own use case. This is more a question (several) than an issue.

I'd like to start generating fst files directly in C++ from my own C++ application, by this I mean the output of my C++ would be fst files rather than flat files. Would that be possible in the current state ? I have not found any documentation about fst (I confess I might not have spent enough time looking) would you have some pointers to get me started ?

Lastly I see that fst is licensed as AGPL 3-0, would that be agreeable for me to use it in a corporate project that would not be sold nor shared with any other client/party ?

Many thanks

MarcusKlik commented 6 years ago

Hi @statquant, thanks for your kind words, I'm pleased to read that fst works for you!

You can definitely generate fst files from your own C++ application. For that you need to link the C++ core library of fst, called fstlib which is published here. Both fst and fstlib are licensed as AGPL-3, which means that you can use it for your internal corporate projects (I would be very interested to learn more about them though :-)). The license is a copy-left license, so if you want to build a product around it, you can also do that (and sell it if you want) but the source code should be provided so that the work can benefit the open source community again.

(If that's too restrictive or you would want to use fstlib in a proprietary software product, please let me know so I can provide a custom license as required)

Because fst currently is the main consumer of the fstlib library, I have spent most of my time on fst and much of the documentation for fstlib still has to be build. But the API is quite straightforward. the consumer (your C++ pogram) has to implement several interfaces (abstract classes) that provide access to the data in your own table classes. A simple sample (C++) implementation is provided which is also used to run the unit tests.

The reason for the abstract classes is that fstlib uses a zero-copy strategy for speed. So instead of having to copy your model of a table to fstlib's model of a table, the underlying data is accessed directly. In R for example, the implementation of the abstract classes provide direct (no-copy) access to the in-memory data from a data.frame object (or equivalent).

If you want to setup a project using fstlib, I'll be happy to help where I can, perhaps we can use this issue as a starting point.

thanks!

statquant commented 6 years ago

Hello @MarcusKlik very happy to contact you offline and explain everything I plan to use fst for in detail. What is the best way to contact you ? As to https://github.com/fstpackage/fstlib/issues/5 that's perfect I am happy I can contribute even very modestly

twest820 commented 3 years ago

Hi @MarcusKlik, similar to @statquant I'm considering generating fst files directly, just from .NET rather than C++. Am I correct in thinking fstlib/tree/master/lib/fst/interface would be the interface to wrap (e.g. with a corresponding set of classes in C++/CLI) for fstlib 0.1.1? I don't see notes about this in fstlib#5 or fstlib#9.

This isn't a big deal for us (while our data flow currently involves remote .csv creation, we can run R on the creating machines to transcode .csv to .fst to get an order of magnitude reduction in bytes needed to be moved across the network) but it would be nice to have for simplicity.

MarcusKlik commented 3 years ago

Hi @twest820, thanks for your question! Yes, you're right, an implementation in C++/CLI would involve implementing the interfaces defined in lib/fst/interface but unfortunately the documentation for that is very limited. Do you have a custom (.NET) representation of in-memory tables for your application or is there a commonly used default representation of columnar datasets in .NET ? It would definitely be nice to have .NET bindings for fst build to work with tables in .NET through CLI.

twest820 commented 3 years ago

Hi @MarcusKlik, you're asking about System.Data.DataTable in effect. However, I hesitate over how congruent ADO.NET (System.Data) might be with the sort of scenarios where people want to use fst as it tends to be oriented towards business recordkeeping use of SQL and seems less prevalent in data science-business intelligence and scientific computing that's likely to integrate with R. A lot of data generating code is likely to write to a FileStream or wrapping TextWriter from application specific data structures. Those structures might be entirely in memory as R tables are (or possibly paged) but it's probably about as common to stream lines to a file.

For apps with in memory storage, the most natural integration is probably implementing IFstTable on whatever class is the top level storage. I'll need to take a more thorough look than just skimming the code on github. For the streaming model I'm not seeing that FstStreamer is hooked up or that FstChunk has an implementation in fstlib but that might be the integration point if it does get implemented.

fstpackage commented 3 years ago

Hi @twest820, yes, from what I can see, System.Data.DataTable stores it's data row-wise in memory, so that wouldn't really fit a columnar storage format like fst (it would have very large overhead). As you say, implementing IFstTable would probably be the best solution, does your internal data structure have a columnar layout?

Are you thinking of using the fst format primarily to reduce network load? So, generating (heavily compressed) chunks of data and sent those over the line to a client application?

twest820 commented 3 years ago

Hi Marcus, I've been having a look at the fstlib sources and there's a number of things to consider. I'll try unpack what seem to be the most significant of them here. First, to answer your questions, our use structure is fairly typical cloud compute jobbing. The typical data lifecycle is download of .fst files from the compute machine to an analysis machine, many reads of those files as analysis proceeds, and upload of files which are important to retain to long-term cloud backup. So we care about network throughput on download in order to quickly get new candidate files to analysis machines as well as compensating slower upload rates to cloud stores. But we also care about the total footprint of file sizes in storage and their read rates. For now, the default .fst compression level is fine as it gives an order of magnitude size shrink from .csv, an order of magnitude speedup over data.table::fread(), and two orders of magnitude over readr::read_csv(). While I'll likely do more probing of the optimal compression level over time my preliminary look at non-default levels indicated only incremental improvements.

Where I see more potential is in structural changes. .NET code implementing structure of arrays for SIMD is very likely to use Array or List<T>. Given a C++/CLI wrapper equivalent to fsttable, these translate to the Data() members of fstlib's I{type}Column family via pin_ptr. There is casting and lifecycle bookkeeping and some other things, but the main thing is fstlib can consume managed memory directly without transcoding when data types match. With fstlib 0.1.1, this nominally works for the epu8 (char or unsigned byte), epi32 (int for C++ compilers with 32 bit interpretations of int), epi64 (long long), and pd (double) SIMD types. It's not supported for epi8 (signed byte), epi16 (short int), epu16 (unsigned short int), epu32 (unsigned long int), epu64 (unsigned long long), and ps (float) types. There's also potential interoperability issues with .fst files since fstlib is not distributed as a binary. Different builds of it may therefore use 16 bit integers or disagree about endianness depending the compiler used. That's not something I've looked into as yet.

The main thing for me is nearly all of our actual data is single precision. So I'll have to inflate it to double precision to present it to fstlib as an IDoubleColumn. Since FstStore::fstWrite() consumes an entire table at once this requires a triple memory footprint during write (float data + double copy = 3x float data size). So I suspect there'd be a noticeable decreases in file size and improvement in read and write speeds if there was a way to generate .fst files with IFloatColumns and then have those inflated to double at read_fst() time to make R base happy. (Being able to roundtrip them back to float after manipulation in R looks like a very low priority for us.)

The other major structural opportunity for us is usually we have about as many factor columns in .fst files as we do float columns. In many cases the factors are constant or change only infrequently, so it's likely much of the size advantage we get from .fst is from run length compression on these columns. Generating .fst files directly allows me to move unchanging factor columns into metadata for (probably) some additional size reduction and reflate them into tibbles in our functions which wrap fst_read(). There are also cases where I anticipate getting some shrinkage by converting things currently written as strings to .csv to factors. In all of our cases where factor values do change, IFactorColumn's native type of int is overkill and byte would suffice. I can probably implement that through metadata, IByteColumn, and mutate(foo = factor(foo, levels = ...)) as a workaround for IFactorColumn's width.

Formally, there's also perhaps some potential for improvement from explicitly run length encoded columns. For example, we might have a table with 230,000 rows which are 10,000 rows each for 23 valid combinations a factor with four levels and a factor with five levels. So, for one factor, run length encoding looks like "repeat this byte for 10,000 rows". For the other it's "repeat this byte for 40,000-60,000 rows". I suspect the main advantages will be in using metadata and dropping from int to byte for columns we do have to carry but more flexible factor support would be a little bit cleaner than manipulating column types and putting the associated factor levels in metadata so files remain self-documenting. (We don't have scenarios for them, but ints seem like overkill for ILogicalColumn too. The compilers I know offhand all store boolean types as bytes.)

The last design consideration I have at the moment is we currently append to .csv files. For example, run at 100% CPU for several hours, stream that set of results to .csv in case the machine or process crashes, and repeat until done. With moving to .fst this probably turns into writing incremental shards and then doing a merge pass at the end of a successful run. There are some subtleties there since fstlib uses openmp rather than than task-level parallelism. I suspect this just means a ReaderWriterLockSlim for setting openmp to single threaded authoring of shards and then enabling multithreading for shard union operations. However, I haven't written any tests for the C++/CLI version of OpenMPHelper yet.

I do anticipate aspects of all this will change based on what I'm able to accomplish with the current fstlib 0.1.1 build. The replacement of .csv with direct .fst generation rather than our current practice of generating .csv and transcoding to .fst at the end of runs is a side project. So it's low priority and, as such, I don't presently have an ETA for when I might get it operable.