Expose API for raw decoder primitives for LZMA and LZMA2.

chyyran commented 2 years ago

Pull Request Overview

This pull request fixes #72 by refactoring decoder functions to expose a zero-cost raw decoder API. The existing public API has been refactored to use the raw decoder API internally. The following additions to the public API become available with the raw_decoder feature enabled under the decompress::raw module.

LzmaDecoder
- LzmaDecoder::new(params: LzmaParams, memlimit: Option<usize>) -> Result<Self>
- LzmaDecoder::reset(&mut self, unpacked_size: Option<Option<u64>>)
- LzmaDecoder::decompress<'a, W: io::Write, R: io::BufRead>(&mut self, input: &mut R, output: &'a mut W)
Lzma2Decoder
- Lzma2Decoder::new() -> Self
- Lzma2Decoder::reset(&mut self)
- Lzma2Decoder::decompress<'a, W: io::Write, R: io::BufRead>(&mut self, input: &mut R, output: &'a mut W)
LzmaProperties
LzmaParams
- LzmaParams::new(properties: LzmaProperties, dict_size: u32, unpacked_size: Option<u64>) -> Self
- LzmaParams:read_header<R>(input: &mut R, options: &Options) -> error::Result<LzmaParams>
  - This was a previously internal API that is now exposed only when raw_decoder is enabled. I don't see any reason to restrict this to internal usage.

Additionally, annotations have been added to indicate availability of APIs on docs.rs.

Testing Strategy

[x] The existing test suite is sufficient and should all pass. The existing API tests have been refactored to use the raw_decoder APIs internally.

chyyran commented 2 years ago

The current API for decompress is very simple and matches up well with the existing one-shot functions. One downside of such a design is that the LzBuffers are still one-shot and result in an allocation every time decompress is called. However, being able to reuse a DecoderState is already a huge savings owing to the size of that struct.

It might be worth looking into allowing reusage of a backing buffer in an LzBuffer but this requires further design work on the LzBuffer implementations that isn't really related to #72. The current API design here does not preclude a future API that would allow reusage of these buffers.

A note, LzmaDecoder::reset/Lzma2Decoder::reset does not reallocate because the lclppb parameters are the same throughout the lifetime of the LzmaDecoder.

chyyran commented 2 years ago

cc @gendx

chyyran commented 2 years ago

I encountered a situation where simply resetting the decoder state was not enough but since the decoder is being reused for each raw LZMA chunk, the unpacked size was also needed to be set. Since this might not be the case for every usage, I have it in reset as Option<Option<u64>> where None leaves the unpacked size unchanged, and Some(_) changes the unpacked size in the decoder state. Having to keep track of the previous unpacked size for streams with headers seemed annoying and Option<u64> is ambiguous between "keep the unpacked size the same" and "change the unpacked size to None (for an end of payload marker)".

I thought about exposing set_unpacked_size to LzmaDecoder but it didn't really make sense in my perspective since reset prepares the decoder for the next chunk of data. For some streams, forgetting to call set_unpacked_size may leave the decoder in an invalid state for that specific stream, so keeping it atomic was best. I don't see any situation where someone would need to change the unpacked size between raw chunks.

jtmoon79 commented 1 year ago

Hi @chyyran @gendx

I wanted to read an XZ file into decompressed chunks of some constant size. Currently I'm using xz_decompress. But xz_decompress reads the entire file in one call. I want to call some "read" function multiple times to sequentially reads the XZ into fixed size buffers (https://github.com/jtmoon79/super-speedy-syslog-searcher/issues/182).

Can I do read an XZ file as a sequence of decompressed byte chunks (Vec<u8>) using these new API endpoints?

Thanks for creating this library! 😄

gendx / lzma-rs

Expose API for raw decoder primitives for LZMA and LZMA2. #74

Pull Request Overview

Testing Strategy