Open phillc73 opened 4 years ago
Hi @phillc73,
thanks for the pointers!
From what I read in the fsst
publication, it looks like fsst
has a smart and promising approach especially tuned to (short) sting compression and it shows solid compression performance. That makes it very suitable for character column (de-)compression. And the licensing seems suitable for use in fst
.
In a nutshell, it uses a scheme similar to R
factors, but with levels corresponding to sub-strings. The lookup-table is extremely small (only 256 levels), it does surprise me that you can have these compression ratio's with a lookup-table that small. And I think some testing is needed to see if the scheme works equally well for UTF-8 strings and vectors with a large percentage of unique values (where occurrences of substrings in the lookup table are very low).
To make it work in parallel, we will still to process in blocks (:-)) but because each block can have it's own lookup-table, that might actually improve compression ratio. Within blocks, decompression can be faster for partial reads (due to the full random access of fsst
compression reads)
interesting, thanks!
Came across the FSST repository today and wondered if this approach might be an improvement for string compression in
fst,
as long as the column type was known or assumed to be strings.The README in that repository says their approach has some advantages for string compression, when compared to the block based LZ4 compression approach.