Closed dragly closed 4 years ago
Thanks for your contribution!
One thing I wanted to do but didn't have the time yet was to have tests comparing the implementation to established libraries such as https://crates.io/crates/lzma-sys or https://crates.io/crates/xz2 (backed by a reference C implementation). That is, the decomp_big_file
test should also check that we get the same result as the system library.
I don't know how hard it would be to decompress a raw LZMA stream with these crates, as opposed to an xz file. If you manage to implement something like that, it would be very helpful in testing the correctness of such changes!
Otherwise, I'll get back to the details when I have more time. But if you have a link to another LZMA implementation where the code == range
case is handled as you describe that would also speed things up :)
Ah, yes. Sorry, I intended to link to the relevant code in XZ Utils, but forgot to do so.
I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:
It is used from here:
As you can see, there is no test here, but a few additional operations after subtracting the range from the code. I am not sure what they correspond to in lzma_rs.
I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well. We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.
By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? 😅 There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)
I am not very familiar with LZMA nor XZ Utils from before, but while attempting to compare the two implementations, I concluded that the corresponding counterpart to get_bit should be here:
Thanks, this file will be helpful in double-checking how the range coder works in detail :)
I did try to use the Rust libraries you mentioned above for our purposes (decoding OpenCTM files), but since they did not work directly on lzma streams, they were not suitable. I am not sure if I will be able to test them as well.
On this topic, turns out that I already wrote a dumb xz encoder https://github.com/gendx/lzma-rs/blob/master/src/encode/xz.rs, that currently packages an LZMA2 stream which itself encapsulates raw uncompressed data. But it shouldn't be too hard to turn that into a packaging tool that takes an already encoded LZMA stream and packages it as an xz archive, which would allow easier comparison with other libraries.
We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.
Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.
By the way, what did you base your implementation of lzma_rs on? Was it an existing codebase, a spec, or do you just know it by heart? sweat_smile There appears to be a quite a few resources out there on LZMA, but the details between them appear to vary quite a lot. It was just interesting to see how different your implementation was to that of XZ Utils. (Yours is way more readable, by the way - it took a while to wrap my head around those C macros, ifdefs and intertwined while-loops and switch-statements...)
I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.
Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?
I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)
We are targeting WebAssembly, so a pure Rust implementation is ideal for us anyways.
Good to hear this is useful :) Btw, the current info/debug/tracing statements induce some overhead due to a lot of extra code that is not removed by the compiler (as these can be enabled by an environment variable). If performance is critical to you, let me know as there is room for improvement there, I just didn't have time to clean it up.
That would be great! Performance is in fact very important to us. I have actually been looking into optimizing lzma_rs after profiling a bit. If I recall correctly, it is about 3x-6x slower than xz when running natively and about as fast as lzma-js when running in the browser.
I think what I found was that there was a fair bit of allocation and deallocation going on when we were parsing a large number of files/chunks. I was therefore thinking about restructuring the code so the decoder could be re-used instead of set up and torn down for each file/chunk. It might be that this is not so important if the performance impact is actually coming from the log-statements. I actually did not know that those were not optimized out in a release build.
I had an old project of writing various C++ parsers to learn (and teach?) about file formats: https://github.com/gendx/tyrex. My goal was to have readable (rather than optimized) implementations. The LZMA code was here. I think I originally read a mix of Wikipedia, the SDK and a few other resources about range coding, and tested on a few examples.
Cool! I will check that out!
Then I ported my old C++ code to Rust as I was learning Rust and realized there was no pure-Rust LZMA codec. So I don't know the specs by heart :) Are there even any specs for LZMA other than the code?
Not that I know of. That is partially why I asked, in case you knew of some resourced I had not found ;)
I was also planning on writing on my blog an introduction to how LZMA works. Might end up doing that sooner than later if lzma-rs is getting interest ;)
That would be awesome! I was thinking about doing something similar: To write down what I learned along the way while trying to understand more of LZMA. However, I do not think my understanding is anywhere near a level where I can confidently explain how it all works :)
bors r+
bors r+
It appears that code being equal to range is a valid state and that some files can have data encoded like this. An example of such a file has been added to the tests.
Pull Request Overview
This pull request fixes #15
Testing Strategy
This pull request was tested by...
.lzma
,.lzma2
,.xz
files).Supporting Documentation and References
The original data was produced by processing a randomly generated geometry with OpenCTM. OpenCTM files can contain embedded LZMA compressed data. This appears to be generated with the liblzma library. The data was then extracted to make a standalone file that is attached in this pull request.
TODO or Help Wanted
None