fstpackage / fstlib

A C++ library for lightning fast multi-threaded serialization of tabular data. Home to the `fst` file format.
Mozilla Public License 2.0
37 stars 9 forks source link

Looking forward to full description of fst format #3

Open xiaodaigh opened 6 years ago

xiaodaigh commented 6 years ago

I know it's going to be a bit of work, but a full-description of the fst format will help build connectors into it. From Julia, Python, and any other programming language. The potential is huge for such an awesome on-disk data manipulation framework!

I will try to help when I know enough C++. I secretly hope that once the format is well known, there can be independent implementation in Julia and Rust (at the risk of running out of sync with C++) but native implementations would be fun. But calling into C++ is also a good option.

MarcusKlik commented 6 years ago

Hi @xiaodaigh, thanks! Yes I definitely need to spent time on documenting the format and perhaps more importantly, the fstlib API, so new connectors can be build!

It's not complicated, but the API will grow as computational features are added (which will run in parallel with the file IO). Providing for methods that can only be run on the master thread (such as R methods) will also have to be reflected in the API. Perhaps Rust with it's better concurrency could provide a faster connector for fstlib, that would be very interesting!

Just a question, why would you prefer a native implementation in Rust or Julia over calling the fstlib library from a Rust or Julia wrapper. Especially Julia will probably take a performance hit if used for the low-level operations that fstlib requires. Or are you referring to a native binding instead of a binding through the R-Julia interface package?

Is there an example package in Julia which could be used to model a native binding, a package using a simple C++ library for example? The binding could be made to a C or C++ API to fstlib using packages like Clang.jl or Cpp.jl, Cxx or CxxWrap perhaps? Starting a toy package early would certainly help a lot to create a uniform API that's suitable for different languages!

xiaodaigh commented 6 years ago

Anyway, the first thing I would do is to use Cxx.jl to call into fstlib. But I might experiment with a pure Julia implementation at some point given the fst format is stable.

Julia has some low-level control as well but not good multi-threading at the moment. I think fstlib is good for scripting languages like Julia, R, and Python so it would be nice to actually write it in a scripting language as well. Given the format is stable, a pure Julia implementation will allow Julia programmers to contribute, not just those with C++ knowledge. But it's overall better to have all resources contribute to one library, in this case a C++ one in fstlib; I wish I know enough C++ to contribute. Learning...

Once the multi-threading story is better in Julia and there is better interop between R-Julia and Python-Julia, then you may be tempted to switch to Julia as well as the syntax is nice and simple, and it can be as fast as C/C++ in many cases.

MarcusKlik commented 6 years ago

Hi @xiaodaigh,thanks, I think using Cxx.jl would be a nice solution where you only have a single code-base. It would be hard to maintain different versioning and new features across two distinct libraries in different languages (and it would cost a lot of time, currently the most valuable resource for fstlib development :-))

I would be very interested in trying to set up a fst package in Julia, please let me know if and how I can help with that!

davidanthoff commented 6 years ago

In general, is there a chance that fstlib might expose a pure C API, not a C++? That would make integration in other languages a lot easier.

E.g. for julia, Cxx.jl is great, but at this point installation is so tricky that it is really not an option for a widely used package. On the other hand, if fstlib just exposed a C API, one could integrate is super easily into julia.

MarcusKlik commented 6 years ago

Hi @davidanthoff, thanks for your question. Basically, the fst package in R also has a C only interface when looked at from the R side (that's all R understands), so that's similar to your request. In R, the Rcpp package is used for convenience and one of the things it does is generate a C interface that can be used by R. From those C wrappers, the underlying C++ code from fstlib is used, would it be possible to have a setup like that for Julia?

For a full implementation of fstlib in Julia, you would need:

These are all abstract classes which would need an implementation based on the Julia API. So you should be able to have access to the Julia API from the DLL.

The reason for that is that fstlib is a zero-copy library. So any data structure (such as columns) needed to hold data should be created directly in Julia and not copied from an existing memory buffer. That reduces memory requirements and increases the speed.

Perhaps when you have a basic setup, I could assist you in implementing the abstract classes for Julia. It would be very interesting to see an implementation of fstlib in other languages than C++ and R!

xiaodaigh commented 6 years ago

Basically for a little bit of context, we have shown via benchmarking that fst has the fastest read/write speed in the Julia/R/Python-verse. Parquet and R's serialization are the only other major one we haven't tested.

So I would be extremely to keen to be able to use fst in Julia.

davidanthoff commented 6 years ago

To be fair, you didn’t measure Feather perf with the R or Python packages, those might be faster than the Julia implementation (or not, who knows).

MarcusKlik commented 6 years ago

Hi @xiaodaigh and @davidanthoff, that's great to hear. It would be nice to compare the various serialization options with a wide range of parameters. For example, for fstlib, the speed depends on a lot of factors:

Testing many systems is very labor intensive, but it would be very interesting to set up a benchmark that uses generated samples with various characteristics:

that way we could really learn about the strong and weak points of different serializers and how they relate to each other. Are your benchmarks published somewhere (or do you have plans for that) ?

thanks!

xiaodaigh commented 6 years ago

Obviously that is going to be a lot of work. I think ultimately we can set up a website where people can submit benchmarks from their system via running some Julia and/or R code. For now I am slowly adding benchmarking codes to the DataBench.jl repo.

MarcusKlik commented 6 years ago

Hi @davidanthoff, on your question about a Julia implementation. Perhaps it would be possible to create a package using small steps:

After milestone 2, we know that we can call the Julia API from the compiled library, that means we can implement the abstract classes from fstlib.

Would that be doable? If any special code is necessary to accommodate the Julia API, I can provide that from the fstlib library (for example, some API calls might only be allowed from the master thread like in R).

xiaodaigh commented 5 years ago

Milestone 1 can be easily achieved see https://github.com/JuliaInterop/CxxWrap.jl

I don't know anything about C++ and that's the issue. I want to help here, but I traced the code to _fst_fstretrieve for reading a fst file. But I can't to figure out how to go any further.

What would help is someone familiar with C++ to do this, but if it's me, I need some speficif directions on how to compile fstlib into a .so file and which C++ functions I can call in this manner?

#include "jlcxx/jlcxx.hpp"

JLCXX_MODULE define_julia_module(jlcxx::Module& mod)
{
  mod.method("greet", &greet);
}