gendx / lzma-rs

An LZMA decoder written in pure Rust
MIT License
129 stars 27 forks source link

support streaming read #10

Open vn971 opened 5 years ago

vn971 commented 5 years ago

Currently, a blocking function is provided by the library that reads from io::BufRead and writes to io::Write. This enforces the user of the library to read all contents into memory, or into a file.

Sometimes, however, it is only needed to traverse the data, but not have it all at once.

Such a thing could be achieved by having a function that, given io::Read, gives something that implements io::Read as well. This way, you can progressively read compressed or decompressed stream, while the library will internally read the underlying stream. This is how xz2 crate works, for example, see the function signature of xz2::read::XzDecoder::new. This also looks very flexible and intuitive as well: decompressor starts to act like a "pipe" (in unix terminology), rather than something that writes.

Support of it in lzma-rs would be very nice I think. Personally, I'm raising the issue because I wanted to try this library in rua https://github.com/vn971/rua Here I am using an intermediate layer of decompression for another function that accepts Read https://github.com/vn971/rua/blob/master/src/tar_check.rs#L26 (however, the underlying library xz2 is not pure Rust, but uses bindings)

Thoughts?

gendx commented 5 years ago

This is a very interesting point!

I was wondering whether there was a generic way of transforming an io::Write into an io::Read. The opposite would be quite simple (read bytes from an io::Read and write them into an io::Write), but this looks trickier. Maybe that could be possible with async functions/generators? Or with a separate process - or simply a thread - that "writes" data to the main thread, which reads it (like with Unix pipes).

In the meantime, I think the easiest way to support streaming would be to extract the loop body of the process function (https://github.com/gendx/lzma-rs/blob/master/src/decode/lzma.rs#L215) into a step function. Then, in the streaming case, use a temporary buffer as the io::Write for the current decoder ; the read method of your io::Read would repeatedly call step and copy the bytes from the tmp buffer into the read buffer.

I probably won't have time to look at it more closely this week, but feel free to send a PR if you want to give it a try!

vn971 commented 5 years ago

Thanks for the explanation!

Regarding the process function and the temporary buffer -- indeed this is how I thought it can be done as well.

I'm not sure I'll have time in the coming days as well though. Maybe I'll come to that later if/when I get rid of other libraries that bind to OS libraries, and will be otherwise on pure Rust.

demurgos commented 5 years ago

Hi, I am maintaining swf-parser, a library to parse SWF files. These files can be encoded with LZMA and I am using this library to decode them. To support streaming parsing of SWF files, support in LZMA is required first. A low level API similar to the one used by the inflate crate would be nice. Using this API, you create a stream inflater maintaining the internal state of parser (for LZMA it would correspond to dictionaries and temporary buffers). You can manually feed data to the decoder it and read the result.

cccs-sadugas commented 4 years ago

I've been working on an implementation for this ticket based off of the LzmaDec_TryDummy function in libhtp's port of the LZMA SDK. The main issue in incrementally executing the loop is that you may end up in a partially corrupted state if you are in the middle of a function and you fail to read the next byte because it isn't available yet.

Also, I used the std::io::Write trait instead of std::io::Read to create an interface like flate2::write::DeflateDecoder.

I'll publish this soon. It will most likely be dependant on #50 .

gendx commented 4 years ago

I'm now wondering whether integrating with async/await would be the way to go to implement this. Something like taking futures::io::AsyncRead as input and writing to a futures::io::AsyncWrite or a futures::stream::Stream of bytes as output.

I don't know what the performance overhead of that would be, but from a programming perspective the code should be similar to the current one (with some extra async keywords). The streaming mode would be gated by a feature flag.

cccs-sadugas commented 4 years ago

@gendx I published a PR for this if you want to have a look. I haven't really thought of implementing it using futures but that's an interesting idea. It would add a couple extra dependencies for those who want to use a streaming API and possibly require a runtime. I was looking for a solution that uses an std::io::Write interface to have an API consistent with flate2::write::DeflateDecoder to implement a generic decoder.

Herschel commented 3 years ago

It'd be useful if a Read interface were also provided (compare flate2 which has both read::DeflateDecoder and write::DeflateDecoder).

soulmachine commented 3 years ago

Reading line by line is very important, for example, flate2 can read .gz files line by line:

let f_in = std::fs::File::open("sample.txt.xz").unwrap();
let d = flate2::read::GzDecoder::new(f_in);
let mut buf_reader = std::io::BufReader::new(d);
for line in buf_reader.lines() {
    println!("{}", line)
}