gendx / lzma-rs

An LZMA decoder written in pure Rust
MIT License
128 stars 27 forks source link

Use const-generics #19

Closed gendx closed 1 year ago

gendx commented 4 years ago

For now, the DecoderState and BitTree structs use memory allocations (in Vec) to store their state, despite having sizes known at compile time. This is in the most part to avoid code duplication.

Once const generics are stable enough, they should be used instead to remove the need for dynamic allocation, and make the whole state struct more compact and contiguous (therefore potentially more cache-friendly).

This should hopefully improve performance a bit.

npmccallum commented 3 years ago

Const generics are now stable in 1.51.

gendx commented 3 years ago

I had a brief look, the min_const_generics feature was indeed stabilized in 1.51, but we still need the more complex const_evaluatable_checked feature due to the 1 << num_bits expression here.

In the meantime, it might be worth experimenting with it on nightly to see if benchmarks would improve with const generics.

chyyran commented 2 years ago

I've been doing some experiments in https://github.com/chyyran/lzma-rs/commits/feature-perf-experiments and found that const generics in BitTree and RangeCoder doesn't actually contribute that much performance benefit, except for large dictionaries.

Without const generics (but with some additional optimizations) https://github.com/chyyran/lzma-rs/commit/ecc71a07edb0bb7942e518b4fc9363a9bccab8f7

running 8 tests
test compress_65536                  ... bench:   1,372,700 ns/iter (+/- 32,501)
test compress_empty                  ... bench:         713 ns/iter (+/- 35)
test compress_hello                  ... bench:       1,021 ns/iter (+/- 50)
test decompress_after_compress_65536 ... bench:   2,463,569 ns/iter (+/- 77,148)
test decompress_after_compress_empty ... bench:       1,720 ns/iter (+/- 146)
test decompress_after_compress_hello ... bench:       2,219 ns/iter (+/- 50)
test decompress_big_file             ... bench:   3,959,745 ns/iter (+/- 3,155,824)
test decompress_huge_dict            ... bench:       2,221 ns/iter (+/- 136)

With const generics https://github.com/chyyran/lzma-rs/commit/ad7d096286c7fc4324ac42716e58d89077e5c63b

running 8 tests
test compress_65536                  ... bench:   1,379,341 ns/iter (+/- 160,163)
test compress_empty                  ... bench:         711 ns/iter (+/- 57)
test compress_hello                  ... bench:       1,016 ns/iter (+/- 30)
test decompress_after_compress_65536 ... bench:   2,442,160 ns/iter (+/- 125,797)
test decompress_after_compress_empty ... bench:         544 ns/iter (+/- 110)
test decompress_after_compress_hello ... bench:       1,003 ns/iter (+/- 47)
test decompress_big_file             ... bench:   3,939,209 ns/iter (+/- 177,650)
test decompress_huge_dict            ... bench:       1,065 ns/iter (+/- 40)

Probably worth pursuing in the future because of that benefit but https://github.com/chyyran/lzma-rs/commit/ad7d096286c7fc4324ac42716e58d89077e5c63b uses kind of a nasty workaround to get around the lack of #[feature(generic_const_exprs)] by putting 1 << N into a const fn and passing it in at the constructor site.

chyyran commented 2 years ago

Vec2D from #77 lets us easily experiment with different backing stores for the literal_probs array which is the main allocation. I used arrayvec which gives us a nice API over a raw array.

Since DecoderState needs to be able to update lclppb at runtime, we can't just parameterize it with const generics unlike BitTree, so we have to account for the theoretical maximum parameters which is sizeof(u16) * 0x300 * { 1 << (4 + 8) } = 3145728 * 2. An array of 6 megabytes is way too large to fit on the stack and I was unable to actually decompress anything without a stack overflow.

For LZMA2 streams, the theoretical maximum parameters for the literal_probs array is sizeof(u16) * 0x300 * { 1 << 4 } = 12288 * 2. While this isn't great, 24KB is way smaller than 6MB so I was able to actually run some benchmarks, which are not great.

These benchmarks are with an allocating BitTree, i.e. without const-generics.

ArrayVec<u16, 0x300 * { 1 << 4 }>

running 8 tests
test compress_65536                  ... bench:   1,323,468 ns/iter (+/- 174,348)
test compress_empty                  ... bench:         695 ns/iter (+/- 20)
test compress_hello                  ... bench:       1,035 ns/iter (+/- 90)
test decompress_after_compress_65536 ... bench:   1,336,165 ns/iter (+/- 57,379)
test decompress_after_compress_empty ... bench:       6,166 ns/iter (+/- 359)
test decompress_after_compress_hello ... bench:       6,531 ns/iter (+/- 224)
test decompress_big_file             ... bench:   3,716,763 ns/iter (+/- 121,865)
test decompress_huge_dict            ... bench:       6,572 ns/iter (+/- 315)

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 13.64s

Box<[u16]>

running 8 tests
test compress_65536                  ... bench:   1,334,964 ns/iter (+/- 1,371,984)
test compress_empty                  ... bench:         709 ns/iter (+/- 436)
test compress_hello                  ... bench:       1,033 ns/iter (+/- 64)
test decompress_after_compress_65536 ... bench:   1,145,697 ns/iter (+/- 82,369)
test decompress_after_compress_empty ... bench:       1,890 ns/iter (+/- 92)
test decompress_after_compress_hello ... bench:       2,231 ns/iter (+/- 295)
test decompress_big_file             ... bench:   3,620,380 ns/iter (+/- 152,660)
test decompress_huge_dict            ... bench:       2,271 ns/iter (+/- 75)

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 7.76s

The huge array on the stack probably causes a lot of cache misses and substantially reduces performance. It seems the only benefit to using const generics here is with BitTree, where it does show an improvement in the benchmarks as above.