dragly commented 4 years ago

It appears that code being equal to range is a valid state and that some files can have data encoded like this. An example of such a file has been added to the tests.

Pull Request Overview

This pull request fixes #15

Testing Strategy

This pull request was tested by...

[ ] Added relevant unit tests.
[x] Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request.

TODO or Help Wanted

None

gendx commented 4 years ago

Thanks for your contribution!

One thing I wanted to do but didn't have the time yet was to have tests comparing the implementation to established libraries such as https://crates.io/crates/lzma-sys or https://crates.io/crates/xz2 (backed by a reference C implementation). That is, the decomp_big_file test should also check that we get the same result as the system library.

I don't know how hard it would be to decompress a raw LZMA stream with these crates, as opposed to an xz file. If you manage to implement something like that, it would be very helpful in testing the correctness of such changes!

Otherwise, I'll get back to the details when I have more time. But if you have a link to another LZMA implementation where the code == range case is handled as you describe that would also speed things up :)

dragly commented 4 years ago

Ah, yes. Sorry, I intended to link to the relevant code in XZ Utils, but forgot to do so.

I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/rangecoder/range_decoder.h#L175

It is used from here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/lzma/lzma_decoder.c#L625

As you can see, there is no test here, but a few additional operations after subtracting the range from the code. I am not sure what they correspond to in lzma_rs.

I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well. We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? 😅 There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)

gendx commented 4 years ago

I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:

https://github.com/xz-mirror/xz/blob/de1f47b2b40e960b7bc3acba754f66dd19705921/src/liblzma/rangecoder/range_decoder.h#L175

Thanks, this file will be helpful in double-checking how the range coder works in detail :)

I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well.

On this topic, turns out that I already wrote a dumb xz encoder https://github.com/gendx/lzma-rs/blob/master/src/encode/xz.rs, that currently packages an LZMA2 stream which itself encapsulates raw uncompressed data. But it shouldn't be too hard to turn that into a packaging tool that takes an already encoded LZMA stream and packages it as an xz archive, which would allow easier comparison with other libraries.

We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.

By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? sweat_smile There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)

I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.

Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?

I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)

dragly commented 4 years ago

We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.

Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.

That would be great! Performance is in fact very important to us. I have actually been looking into optimizing lzma_rs after profiling a bit. If I recall correctly, it is about 3x-6x slower than xz when running natively and about as fast as lzma-js when running in the browser.

I think what I found was that there was a fair bit of allocation and deallocation going on when we were parsing a large number of files/chunks. I was therefore thinking about restructuring the code so the decoder could be re-used instead of set up and torn down for each file/chunk. It might be that this is not so important if the performance impact is actually coming from the log-statements. I actually did not know that those were not optimized out in a release build.

I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.

Cool! I will check that out!

Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?

Not that I know of. That is partially why I asked, in case you knew of some resourced I had not found ;)

I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)

That would be awesome! I was thinking about doing something similar: To write down what I learned along the way while trying to understand more of LZMA. However, I do not think my understanding is anywhere near a level where I can confidently explain how it all works :)

gendx commented 4 years ago

bors r+

bors[bot] commented 4 years ago

Merge conflict

gendx commented 4 years ago

bors r+

bors[bot] commented 4 years ago

Build succeeded

continuous-integration/travis-ci/push

gendx / lzma-rs

Do not raise error if code equals range in get_bit #16

Pull Request Overview

Testing Strategy

Supporting Documentation and References

TODO or Help Wanted

Merge conflict

Build succeeded