101arrowz / fflate

High performance (de)compression in an 8kB package
https://101arrowz.github.io/fflate
MIT License
2.21k stars 77 forks source link

Optimize inner loop in inflt #180

Closed karyon closed 1 year ago

karyon commented 1 year ago

This made the processing pipeline in our application, which is currently dominated by inflt, 3% faster. So the effect on inflt itself might be a bit larger than that.

Note that I don't know much about deflate or compression algorithms in general or low-level JS optimization. To me, this looked like a loop unroll of which V8 couldn't take advantage, maybe because neither source nor target are 4-byte-aligned, or maybe it's just not smart enough. The loop would write up to 3 bytes more than just from bt to end. I assumed this was just an artifact of the loop unroll, and not important since the bytes written in excess are overwritten in the next outer loop iteration.

I also tried a fastpath using TypedArray.copyWithin if end-dt <= bt, in the hope that would do a simple memmove, but it was slower.

I didn't run the tests yet (but it looked to me like there are no functional ones?). Also I did not find any proper benchmarks. I'm happy to run some if there's a benchmark suite that's easy to setup.

101arrowz commented 1 year ago

You can try yarn test and you should get benchmark results in the test/results folder. You can compare before and after to see if it improves performance. See this description for more info.

I'm surprised that this is improving performance for you - last time I tried removing the semi-unrolled loop it actually worsened performance in the common case. I'll have to check this on some of my own test files as well.

karyon commented 1 year ago

I modified the benchmarks so that one run would decompress everything 10 times and take the average, and also removed everything else from the tests here.

So in Node 18 there was virtually no difference. The best run with the new code was 0.15% faster than with the old code, while the average was moving too much in the noise.

In node 20, however (which by the way was overall around 10% faster in decompression than node 18), each run was consistently faster than with the old code. Comparing the best runs, the new code was faster by around 1.3%.

Here's a more detailed comparison of the best-of-five runs: image

Btw, I had to do some fixes to make the benchmarks run on windows. Even then build:demo would still fail, but the benchmarks didn't need it.

Is there an easy way to run this in firefox?

101arrowz commented 1 year ago

Poor Windows compatibility will probably not change for development, unfortunately, as Parcel still does not work perfectly on Windows as far as I know.

The benchmarks results seem reasonable. I'll use this for the next release, thanks for the PR!