Open MarcusKlik opened 7 years ago
I believe the best way to approach a extensive benchmark suite would be to publish code to generate a number of datasets that differ in their characteristics. Each code snipped should generate a single column dataset. Examples of datasets that differ significantly in the resulting serialization speed:
1) random integer column 2) sequential integer column (e.g. 1:1000000) 3) integer column with many NA values 4) integer column in limited range (e.g. all values between -100 and 100) 5) only positive random integers
6) double's generated with runif
7) double related to monetary values (e.g. generated with sample(1:x, n) / 100F
8) double column that only has limited number of distinct values
9) character column with limited number of distinct values (e.g. 'TRUE', 'FALSE', 'NA') 10) character column with short / medium / long strings 11) character vector with special UTF8 characters
12) logicals with 90 percent TRUE and 10 percent FALSE 13) random logicals
and many more. Performance varies a lot between all of these types. When measuring the compression and serialization performance of particular software the first questions that should be answered is 'what data are you actually compressing / serializing ?'. Standard 'text-oriented 'benchmark datasets like the Silesia compression corpus are not very relevant to data science and would not accurately depict performance of packages like fst
, feather
or data.table
(fread
/ fwrite
).
I will definitely use fst to constructed the benchmark suite that I am building for Julia!!
Hi @xiaodaigh, it would be great to have a fst
port for Julia (and Python). The core of fst
is now C++ only, so it should be straightforward to write a wrapper for other platforms. For example, I'm using a pure C++ wrapper around fst
for testing purposes.
But at the moment I'm concentrating on getting the fst
file-format stable and ready for future expansions (like data-hashes, key tables and row- and column binding) , so I won't be spending time on ports to other languages just yet.
Your benchmark suite looks very interesting, it would be great to have fst
compared with other (cross language) packages. If you need any help with that please let me know!
And add a section on benchmarking on https://fstpackage.github.io