Frommi / miniz_oxide

Rust replacement for miniz
MIT License
168 stars 48 forks source link

optimize `inflate::core::init_tree` by precomputing reversed bits #132

Closed connorskees closed 1 year ago

connorskees commented 1 year ago

Somewhat inspired by #82. For small inputs, this function can take up more than 50% of the runtime. As noted in the linked issue, in absolute terms this function is not notable, accounting for at most a couple hundred microseconds, but it is still useful to optimize.

There is only one relevant benchmark in the existing suite:

// before
test oxide::decompress_short_lvl_1      ... bench:       2,966 ns/iter (+/- 67)
// after
test oxide::decompress_short_lvl_1      ... bench:       2,377 ns/iter (+/- 88)

A roughly 25% improvement (though in absolute terms only 500ns). On the image linked in #82, the total runtime improves by about 13%. This benchmark includes the total time to decode the PNG. In absolute terms the performance increase is about 0.4ms on the server I use for benchmarking.

I'm a bit skeptical about the precomputed reversed bits table, and have left a GitHub comment below discussing this.

This PR in its current form should close #82.

Potential future work

There likely isn't much benefit to spending more time on this function. If one really wanted to squeeze more performance out of this, doubling the size of the precomputed reversed bits lookup table should have some impact, though I haven't measured the performance of making such a change. The while rev_code < FAST_LOOKUP_SIZE loop is extremely tight, but I haven't looked at any potential savings there. The for i in 1..16 loop looks extremely similar to prefix-sum which can be optimized with SIMD, but such an optimization would likely require unsafe use of intrinsics.

oyvindln commented 1 year ago

@notgull does it look fine to you as well now?

notgull commented 1 year ago

Yeah, it looks good to me. This is basically how I'd do it.

oyvindln commented 1 year ago

aight. let me know when you want a new release. Can wait a bit if you have more work on the way.