FSE performance optimizations

Cyan4973 / FiniteStateEntropy

New generation entropy codecs : Finite State Entropy and Huff0

BSD 2-Clause "Simplified" License

1.34k stars 144 forks source link

FSE performance optimizations #112

Open pingladd opened 1 year ago

pingladd commented 1 year ago

Hello, I am working on the FSE performance optimizations and trying to implement 8-states instead of the default 2-states, which might accelerate the decompressing by processing the 8-states in parallel. To process the 8-states, I changed the bitContainer to a __m128i type and modified all the related functions. Some files were compressed and decompressed correctly for the tests, but some were corrupted because of minor differences between the decoded and the original files. I tried to find the bug but couldn't make it. I would like to know if it is possible to have someone check my code or discuss it with me? Thank you!

Cyan4973 commented 1 year ago

I would recommend starting with 4-states, using a 64-bit container. This would be much more straightforward and easier to debug.

After such a success, it would be a smaller step to stretch that implementation to 8-states using a 128-bit container.

Note that __m128i is unlikely to be a native type. This comes with some rather complex consequences when operating on such a type, meaning that operations like + or << are no longer as simple as they look, with corresponding impact on performance. As a consequence, it's not obvious if such a move would improve performance.

MarcusJohnson91 commented 1 year ago

Use _BitInt, it’s part of the C23 spec which should be out later this year instead of weird compiler specific pseudo builtin types.

pingladd commented 1 year ago

Thank you all for the reply. I have already implemented the 4-states, and it passed all the tests when running the fuzzer.exe in the FSE repo. I tested some files but did not see any performance uplift. Also, I get errors when running fuzzer.exe from zstd library (some of them are 1-byte compressed size differences).

Yes, for the 8-states, I have implemented the 128-bit shift operations and it seems to add lots of complexity. For the "+" operation, I just used the _mm_or_si128 instruction, maybe there has something wrong with it. (I will check these operations again)

The 8-states probably will not improve the performance, but it just bugs me that I cannot make it work. Thank you!