Benchmark on well-known data sets

MarcusKlik commented 7 years ago

And add a section on benchmarking on https://fstpackage.github.io

MarcusKlik commented 7 years ago

I believe the best way to approach a extensive benchmark suite would be to publish code to generate a number of datasets that differ in their characteristics. Each code snipped should generate a single column dataset. Examples of datasets that differ significantly in the resulting serialization speed:

1) random integer column 2) sequential integer column (e.g. 1:1000000) 3) integer column with many NA values 4) integer column in limited range (e.g. all values between -100 and 100) 5) only positive random integers

6) double's generated with runif 7) double related to monetary values (e.g. generated with sample(1:x, n) / 100F 8) double column that only has limited number of distinct values

9) character column with limited number of distinct values (e.g. 'TRUE', 'FALSE', 'NA') 10) character column with short / medium / long strings 11) character vector with special UTF8 characters

12) logicals with 90 percent TRUE and 10 percent FALSE 13) random logicals

and many more. Performance varies a lot between all of these types. When measuring the compression and serialization performance of particular software the first questions that should be answered is 'what data are you actually compressing / serializing ?'. Standard 'text-oriented 'benchmark datasets like the Silesia compression corpus are not very relevant to data science and would not accurately depict performance of packages like fst, feather or data.table (fread / fwrite).

xiaodaigh commented 7 years ago

I will definitely use fst to constructed the benchmark suite that I am building for Julia!!

https://github.com/xiaodaigh/data_manipulation_benchmarks

MarcusKlik commented 7 years ago

Hi @xiaodaigh, it would be great to have a fst port for Julia (and Python). The core of fst is now C++ only, so it should be straightforward to write a wrapper for other platforms. For example, I'm using a pure C++ wrapper around fst for testing purposes.

But at the moment I'm concentrating on getting the fst file-format stable and ready for future expansions (like data-hashes, key tables and row- and column binding) , so I won't be spending time on ports to other languages just yet.

Your benchmark suite looks very interesting, it would be great to have fst compared with other (cross language) packages. If you need any help with that please let me know!

fstpackage / fst

Benchmark on well-known data sets #5