optimize `inflate::core::transfer`

Profiling the decoding of a sample image using image-rs/png showed that >1/3rd of the time decoding was spent in inflate::core::transfer. This change special cases the scenario in which we do no masking to significantly optimize this function for real workloads. On the image that inspired this optimization work, I see a 55% speedup locally.

Relevant benchmarks:

// before
test oxide::decompress_compressed_lvl_1 ... bench:      89,253 ns/iter (+/- 1,412)
test oxide::decompress_compressed_lvl_6 ... bench:     176,299 ns/iter (+/- 6,266)
test oxide::decompress_compressed_lvl_9 ... bench:     175,840 ns/iter (+/- 3,131)

// after
test oxide::decompress_compressed_lvl_1 ... bench:      83,992 ns/iter (+/- 1,347)
test oxide::decompress_compressed_lvl_6 ... bench:      76,640 ns/iter (+/- 2,288)
test oxide::decompress_compressed_lvl_9 ... bench:      76,791 ns/iter (+/- 1,978)

Omitted benchmarks do not execute this codepath. For oxide::decompress_compressed_lvl_6 and oxide::decompress_compressed_lvl_9 we see incredible improvements of ~2.3x.

Potential future work

It should be possible to speed up the case in which source_pos.abs_diff(out_pos) == 3, but I wasn't able to find an easy solution that doesn't use unsafe or SIMD intrinsics. The 3 diff case accounts for about 50% of all cases on some benchmarks that do not see a performance increase from this change.

It should also be possible to do much better on the general case when source_pos.abs_diff(out_pos) >= 4 using either clever hacks to make the compiler do what we want or SIMD intrinsics. I have not profiled how much of an impact a more efficient implementation of this case would have.

Frommi / miniz_oxide

optimize `inflate::core::transfer` #131

Potential future work