Open DanShaders opened 2 weeks ago
@timschumi, here are performance numbers on i9-13900H (that were physically very painful to get because my laptop doesn't provide enough cooling for even a single core to work on max turbo frequency for 1 minute):
testcase | time |
---|---|
inflate_async_1gb (factor = 2.0) | 3655.9±49.9ms (min=3570ms, max=3719ms, total=36559ms) |
inflate_async_1gb (factor = 1.0) | 3755.1±48.1ms (min=3675ms, max=3814ms, total=37551ms) |
inflate_sync_1gb | 6236.7±55.3ms (min=6124ms, max=6294ms, total=62367ms) |
I cannot immediately tell what's wrong with your argument against optimization_factor
but, empirically, it seems to give a 3% boost out of 2 line diff.
The file I used for benchmarking is some old 1GB wiki dump: https://www.dshpr.com/wiki-dump-1g. Benchmark itself is
This PR builds upon previous work in the foundational coroutines PR and implements streamable asynchronous, error-safe, and EOF-correct decompression. Incidentally, new asynchronous implementation is about 2 times faster than our previous synchronous one.
In the future, to address code duplication the PR introduces, I plan to create a AsyncStream -> Stream translation mechanism and use it to reroute old classes to the new implementation.
(Deflate algorithm itself is pretty much directly copied from the old implementation, so it probably doesn't require as much attention as scaffolding.)
@timschumi, I'm sorry for yet another gigantic PR :).