fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
614 stars 42 forks source link

Consider FSST for faster, smaller string compression #235

Open phillc73 opened 4 years ago

phillc73 commented 4 years ago

Came across the FSST repository today and wondered if this approach might be an improvement for string compression in fst, as long as the column type was known or assumed to be strings.

The README in that repository says their approach has some advantages for string compression, when compared to the block based LZ4 compression approach.

MarcusKlik commented 4 years ago

Hi @phillc73,

thanks for the pointers!

From what I read in the fsst publication, it looks like fsst has a smart and promising approach especially tuned to (short) sting compression and it shows solid compression performance. That makes it very suitable for character column (de-)compression. And the licensing seems suitable for use in fst.

In a nutshell, it uses a scheme similar to R factors, but with levels corresponding to sub-strings. The lookup-table is extremely small (only 256 levels), it does surprise me that you can have these compression ratio's with a lookup-table that small. And I think some testing is needed to see if the scheme works equally well for UTF-8 strings and vectors with a large percentage of unique values (where occurrences of substrings in the lookup table are very low).

To make it work in parallel, we will still to process in blocks (:-)) but because each block can have it's own lookup-table, that might actually improve compression ratio. Within blocks, decompression can be faster for partial reads (due to the full random access of fsst compression reads)

interesting, thanks!