Frommi / miniz_oxide

Rust replacement for miniz
MIT License
168 stars 48 forks source link

optimize `inflate::core::transfer` #131

Closed connorskees closed 1 year ago

connorskees commented 1 year ago

Profiling the decoding of a sample image using image-rs/png showed that >1/3rd of the time decoding was spent in inflate::core::transfer. This change special cases the scenario in which we do no masking to significantly optimize this function for real workloads. On the image that inspired this optimization work, I see a 55% speedup locally.

Relevant benchmarks:

// before
test oxide::decompress_compressed_lvl_1 ... bench:      89,253 ns/iter (+/- 1,412)
test oxide::decompress_compressed_lvl_6 ... bench:     176,299 ns/iter (+/- 6,266)
test oxide::decompress_compressed_lvl_9 ... bench:     175,840 ns/iter (+/- 3,131)
// after
test oxide::decompress_compressed_lvl_1 ... bench:      83,992 ns/iter (+/- 1,347)
test oxide::decompress_compressed_lvl_6 ... bench:      76,640 ns/iter (+/- 2,288)
test oxide::decompress_compressed_lvl_9 ... bench:      76,791 ns/iter (+/- 1,978)

Omitted benchmarks do not execute this codepath. For oxide::decompress_compressed_lvl_6 and oxide::decompress_compressed_lvl_9 we see incredible improvements of ~2.3x.

Potential future work

It should be possible to speed up the case in which source_pos.abs_diff(out_pos) == 3, but I wasn't able to find an easy solution that doesn't use unsafe or SIMD intrinsics. The 3 diff case accounts for about 50% of all cases on some benchmarks that do not see a performance increase from this change.

It should also be possible to do much better on the general case when source_pos.abs_diff(out_pos) >= 4 using either clever hacks to make the compiler do what we want or SIMD intrinsics. I have not profiled how much of an impact a more efficient implementation of this case would have.

oyvindln commented 1 year ago

Really nice.

Yeah lack of safe SIMD is an issue I guess. Often the compiler is smart enough to use it but it varies. rustc and underlying llvm have some options to help debug why it didn't vectorize stuff but it's clunky.

There's probably a fair bit of room for improvement on the encoding side too.