fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
613 stars 42 forks source link

any plans for python/julia interfaces #184

Open jangorecki opened 5 years ago

jangorecki commented 5 years ago

Are there any plans to make interfaces from other languages to your binary format. Having python or julia interface we can easily move data between different platforms. Something that feather was meant to do, but it is slow and crashing in R even for 500 MB data csv input.

MarcusKlik commented 5 years ago

Hi @jangorecki, thanks for asking, the answer is yes, absolutely!

The underlying lib that powers fst is called fstlib (and is available here). It compiles on all major platforms and recently I updated the Travis builds to include also Windows.

To add a new client language, a wrapper for fstlib has to be created and the fstlib API needs to be implemented (mainly to create and delete memory and map the specific types to native types).

I don't have much experience using Julia but creating a new wrapper is very important I think (any help in that department would be much appreciated :-)

By the way, @xiaodaigh created a wrapper for the fst package in Julia (see here). So that package is not a direct implementation of the fstlib library but rather a wrapper around the R package.

(the same holds true for Python. Getting a package out is high on the priority list.)

randomgambit commented 5 years ago

hello there! as you may have noticed, I think there is a need for an efficient storage format that works with R and Python. Do you have timeline for the fst python bindings?? Happy to do some testing if needed.

Thanks!!!

MarcusKlik commented 5 years ago

Hi @randomgambit, thanks for the heads-up! Yes, there seems to be a void between R and Python that could be filled nicely with fst bindings for Python I think. My plan is to get a Python package operational before the end of this year.

Your offer to help with testing is much appreciated!

randomgambit commented 5 years ago

before the end of this year.

I really wish you said before the end of the month instead!! :D

MarcusKlik commented 5 years ago

Ha @randomgambit, yes, the same here! If only I had more time, I'll make sure to talk to my 'day-time-job' director on your behalf :-)

asgr commented 4 years ago

Just curious if there is any update on the progress on the Python side? I use R, but a lot of people in my research field (astronomy) use Python. Feather seems the current best bet in this regard, but FST seems like it could be a decent step up given its subsetting and compression capabilities.

MarcusKlik commented 4 years ago

Hi @asgr,

thanks for your question. Yes, the Python bindings are long overdue and the fst format could be a faster and more dynamic bridge between R and Python than feather.

The python bindings could basically follow the same strategy as the r bindings: the fstlib library generates 1D numpy arrays from the stored column data. And those arrays can be wrapped into a pandas data frame.

I'll try to get a repository up and running soon with an initial package version and we can work from there (user input much would be much appreciated :-)).

jangorecki commented 4 years ago

pydt would be useful too, fyi @st-pasha

MarcusKlik commented 4 years ago

that sounds like a great default return type for python's read_fst() 😺

st-pasha commented 4 years ago

@MarcusKlik Do you have any documentation for the fst file format?

MarcusKlik commented 4 years ago

Hi @st-pasha, thanks for asking, are you interested in a specification of the format meta-data, data-block design, etc or the C++ API documentation?

(both are not readily available at the moment, but just to know were to direct my efforts 😸)

xiaodaigh commented 4 years ago

format meta-data, data-block design for me as I am writing a Julia serializer

MarcusKlik commented 4 years ago

Ha @xiaodaigh, that's great to hear. I suspect that you won't need the exact details of the fstlib implementation and fst format (but you will definitely need good API documentation)

(that is, unless you mean you like to write your own format, in that case format specs are of interest off course)

The fstlib library has an abstract representation (C++) of a table and it's columns, and it will take some effort to write an implementation for that using the Julia C/C++ API and internal data layout.

Please let me know if I can help you there, implementing a Julia binding will be a very good test of the flexibility of fstlib :-)

(see also this issue in fstlib)

st-pasha commented 4 years ago

Hi @MarcusKlik , sorry I should have given more context for my question.

So, I'm a primary developer of the Python datatable library. This library provides a data frame object and facilities to manipulate this data frame. So, I guess it's pretty close to fstlib in functionality. We also have our own format for storing data on disk, called Jay.

Some time in the near future (maybe around winter) we were planning to add integrations for other on-disk data formats, foremost arrow (feather) and parquet. And, as @jangorecki points out, the fst format is another good candidate to consider.

In other words, I'm not looking to using the fstlib itself, just add the ability to read (possibly write) fst files produced by some source (say, it can be written in R and then read in Python by datatable). This is, of course, conditional on whether you'd be ok with making your file format open for 3rd-party libraries to implement and use, especially if those other libraries are not GPL-based.

So, if this all sounds agreeable to you, I would be looking for a document describing how to interpret data stored in a .fst file. Something similar to our Jay format description linked above.

MarcusKlik commented 4 years ago

Hi @st-pasha, no problem!

I'm very familiar with your work on (py)datatable (big fan ;-)) and just wanted to get clear how you would integrate fst into the package.

In short, fstlib is similar to parquet and feather libraries; it contains the code to wrap an existing data structure (e.g. a (py)datatable) and serialize that to disk (or RAM in a future upgrade).

So it was explicitly not designed to manipulate in-memory datasets, like datatable and pandas (and arrow).

What the fstlib will be able to do is to run custom functions while loading data from disk. So during a load, each chunk can be processed on main- or background threads. This will enable fast calculations on on-disk data. But the actual methods used for these calculations will be provided by the user.

This is a difference with the goals of arrow for example because arrow does provide code for operations on it's internal dataframe structure. So it aims to be a universal dataframe manipulation framework, that can be used from different languages and systems.

With fstlib calculations are done by the client, leveraging the strong points of specific languages and it's functions.

The fst format is tightly bound to the fstlib library, as data-blocks and meta-data are compressed using optimized algorithms that are (only) available in fstlib. For example, compression usually involves a bit-shifting filter to speed up results. This filter is part of the fstlib library.

For datatable to add the fst format to it's reading and writing capabilities, fstlib will have to be compiled and integrated with datatable. Then, a (zero copy) wrapper for a datatable object can be created so that fstlib can (de-)serialize data to disk.

Currently, fstlib has a AGPL-3.0 license, and the LZ4 and ZSTD compression libraries have their own licenses (BSD). So that cannot easily be re-licensed to datatable's MPL-2.0 at the moment (I think). Options are to create a package fst for Python that returns a datatable, that would give a separation of licenses. Another option would be to create a special license for use by datatable.

please let me know what you think, thanks!

st-pasha commented 4 years ago

Hi Marcus,

Based on your description it looks like the fst format is sufficiently complicated that it doesn't make sense to create an independent reader. In that case the simplest solution would be to have a separate fst library wrapping the fstlib.

Then in datatable we could have simple wrappers such as

class Frame:
    def to_fst(self, path):
        import fst
        fst.write(self, path)

def fread(path):
    if path.endswith(".fst"):
        import fst
        fst.read(path, output_format="datatable")

We also have a feature proposal (https://github.com/h2oai/datatable/issues/1950) for implementing xread() which reads data + performs computations on that data at the same time. We will need to think how to integrate this with fst properly.

For now, however, there are 2 main questions:

  1. How the fst package can create a datatable Frame? We'll have to add an API function into datatable for that, which is not that hard. Ultimately, a Frame is just a list of named columns, and all we need is to understand what kind of a notion of a "column" fstlib exposes. Specifically, we'll need to know how fst encodes NAs, string data (including non-ascii), datetime objects, etc.

  2. How the fst package can read the existing datatable Frame? We already have API for accessing raw frame's data, but that works for "material" data only. Generally, datatable supports "virtual" (computed) columns too, and I wonder whether fst can be made to read those columns directly without materializing?

MarcusKlik commented 4 years ago

Hi @st-pasha,

thanks, that sounds excellent. On your 2 main questions:

  1. The (py)fst library implements virtual C++ classes from fstlib. These (relatively simple) virtual classes include a table factory and column factory. The implemented (py)fst C++ classes allow creation of Frame's and the correct columns. So the details of mapping specific column types from the fst format to Python are contained in the (py)fst package. Conversion from the different representations of NA's will be handled in the fstlib library however, as that is a cross-language problem (same with strings). At first, I think the creation of columns and Table's can be done by calling Python code from the C++ side. The overhead should be minimal because there are only relatively few of these calls for each read.

  2. The (py)fst library also has to implement a virtual class to represent a Frame. That C++ implementation includes member functions to access the underlying raw data and these can perfectly use the existing datatable API to access that.

There are currently a few virtual columns in the fst format, but only for boundary cases like a factor column with just a single factor (which can be represented by a few numbers). Columns like sequences from n to m will also be encoded in dense format later on. Virtual columns would be a tremendous enhancement to that and I would very much like to see how we can support that. The challenge is to provide a cross-language way of encoding common expressions and constants. Virtual columns that depend on other virtual columns should also be possible. Does datatable have such an universal implementation?

Interestingly, on the R side, virtual columns can be implemented using the new ALTREP framework that was released with R 3.5.

Your xread() proposal is very interesting and something similar is planned for fst. The idea is that during reading, additional transforms can be added to the processing. Now fst does read -> decompress -> bit-shift for each (16 kB) data-block, but additional transformations can be added like row-selection, or (custom-) functions. The plan is to restrict these transforms to the main thread and do the reading and decompression on the background threads. That way, the user can call native R or Python methods and not get into trouble with memory management. This only works for methods that have a map-reduce like implementation (think sum(), min(), max()). Other methods might only be applicable when full columns are read first (think median(), my_custom_func()). The same applies to reads where a by argument is selected, except for sorted table's.

So, bottom line, the setup is very similar to the setup used for the fst package in R. We will need implementations of interfaces for factories, virtual table's and virtual columns in a (py)fst package. That package has a dependency on datatable as the implementations require constructors and the API of datatable. Seen from datatable the impact is very small, the methods like the ones you posted above are probably sufficient.

Thanks!

st-pasha commented 4 years ago

Hi Marcus,

I presume you have much more experience with developing R libraries than Python extensions, so let me point out few peculiarities of Python that could be relevant to the design process.

  1. Python has only one native list type: a list of objects (or more precisely, a PyList of PyObjects). Unlike in R, there are no native types for "list of ints", or "list of strings", etc. The closest alternative that we have is a numpy array, or a pandas series, or a datatable frame, or arrow dataframe -- all of them keep their data in C structures, exposing to python only a "frontend" object that marshals all methods to the backend implementation.

  2. For the same reason, calling native python functions for data transformations would not work: you'd need numpy methods, or datatable methods, or arrow methods, etc.

  3. Since we want to create a native C object from another C library, this calls for a C API between the libraries. Luckily, Python supports this use case via so-called capsule objects. This is neat, because you'd only need a single.h file when compiling your library, and they'll be linked dynamically at runtime. In fact, fst wouldn't even need to list datatable as a dependency: it can attempt to import the functionality at runtime, and then fail with a graceful error if the user doesn't have the module installed.


Virtual columns that depend on other virtual columns should also be possible. Does datatable have such an universal implementation?

Virtual columns are a new functionality in datatable, their implementation is largely complete, though there's still some refactoring to do to make sure the existing code uses the new functionality to full extent. And yes, in our design a "virtual column" is an object that knows how to calculate its i-th element. For example, a binary_plus column could look like this:

template <T> class binary_plus : public ColumnImpl {
  Column lhs, rhs;
  bool get_element(size_t i, T* out) {  // returns `isna` flag
    T x, y;
    bool lhs_isna = lhs.get_element(i, &x);
    if (lhs_isna) return true;
    bool rhs_isna = rhs.get_element(i, &y);
    if (rhs_isna) return true;
    *out = x + y;
    return false;
  }
};
MarcusKlik commented 4 years ago

Hi @st-pasha,

thanks for the pointers! And yes, your assumption is very correct :-)

About your second point, could we:

  1. Let the fst.read() method materialize Frame's from subsets of the stored data.
  2. Do transformations on those subsets
  3. Combine the transformed Frame's into a single larger frame

That way, wouldn't it be possible to use datatable operations (from the python API) to do the transformations in step 2?

Or, perhaps simpler, when column A is being transformed, column B can be read into memory on background threads. When that's finished, column B can be transformed while column C is being read, etc.

Obviously, we would have datatable and fst competing for thread resources so need some way of tuning that...

thanks!

ssh352 commented 4 years ago

In my team some uses R and others use python, so we had to use hdf5 because fst is only for R. But I like fst better.

MarcusKlik commented 4 years ago

Hi @ssh352,

thanks, I'm happy to hear that fst works for you and your team! Yes, the python bindings are an important next step for the fst format, and the goal is to roll out a package in 2020.

Hope you and your team can wait for that :-)

chasemc commented 4 years ago

Just wanted to express that being able to read in Python would be extremely useful :)

richierocks commented 3 years ago

This issue has been quiet for a while. Has any progress been made with Python access to fst? (I'm very excited for this feature!)

MarcusKlik commented 3 years ago

Hi @richierocks, thanks for checking in on the progress. Unfortunately, I haven't had much time to work on a python package (as you have noticed :-)).

I do think the python bindings are very important, it's just that time is a real bottleneck here. The package will probably have to wait until later in 2021, apologies for that!

jangorecki commented 3 years ago

This SO can be improved when fst in python will be ready https://stackoverflow.com/a/64880745/2490497

MarcusKlik commented 3 years ago

Hi @jangorecki, thanks for the heads up, when fst has a python interface, I will make sure to add the timings to your SO answer!

Maaxion commented 2 years ago

Has there been any progress on the project to create a python interface for fstlib?

Roleren commented 3 months ago

Hey, would give you a heads up we in the genomics community are starting to experiment with this format, it is a very powerful substitute for older formats like bam files. I do believe if you do not have time to fix this yourself that funding could be acquired through grants etc, also supervised master students could do project courses to implement simpler/smaller parts. Many possibilities here, let me know if this could be of interest.