Consider FSST for faster, smaller string compression

Hi @phillc73,

thanks for the pointers!

From what I read in the fsst publication, it looks like fsst has a smart and promising approach especially tuned to (short) sting compression and it shows solid compression performance. That makes it very suitable for character column (de-)compression. And the licensing seems suitable for use in fst.

In a nutshell, it uses a scheme similar to R factors, but with levels corresponding to sub-strings. The lookup-table is extremely small (only 256 levels), it does surprise me that you can have these compression ratio's with a lookup-table that small. And I think some testing is needed to see if the scheme works equally well for UTF-8 strings and vectors with a large percentage of unique values (where occurrences of substrings in the lookup table are very low).

To make it work in parallel, we will still to process in blocks (:-)) but because each block can have it's own lookup-table, that might actually improve compression ratio. Within blocks, decompression can be faster for partial reads (due to the full random access of fsst compression reads)

interesting, thanks!

fstpackage / fst

Consider FSST for faster, smaller string compression #235