Closed connorskees closed 1 year ago
Really nice.
Yeah lack of safe SIMD is an issue I guess. Often the compiler is smart enough to use it but it varies. rustc and underlying llvm have some options to help debug why it didn't vectorize stuff but it's clunky.
There's probably a fair bit of room for improvement on the encoding side too.
Profiling the decoding of a sample image using
image-rs/png
showed that >1/3rd of the time decoding was spent ininflate::core::transfer
. This change special cases the scenario in which we do no masking to significantly optimize this function for real workloads. On the image that inspired this optimization work, I see a 55% speedup locally.Relevant benchmarks:
Omitted benchmarks do not execute this codepath. For
oxide::decompress_compressed_lvl_6
andoxide::decompress_compressed_lvl_9
we see incredible improvements of ~2.3x.Potential future work
It should be possible to speed up the case in which
source_pos.abs_diff(out_pos) == 3
, but I wasn't able to find an easy solution that doesn't use unsafe or SIMD intrinsics. The 3 diff case accounts for about 50% of all cases on some benchmarks that do not see a performance increase from this change.It should also be possible to do much better on the general case when
source_pos.abs_diff(out_pos) >= 4
using either clever hacks to make the compiler do what we want or SIMD intrinsics. I have not profiled how much of an impact a more efficient implementation of this case would have.