KillingSpark / zstd-rs

zstd-decoder in pure rust
MIT License
253 stars 34 forks source link

RFC: read frame size without decoding #42

Closed phord closed 1 year ago

phord commented 1 year ago

I need to measure the physical size of the frames without actually decoding them for my multi-frame approach. This way I can build an index of frames and sizes without the expense of decompressing everything.

I haven't tested this code. But I added a BlockDecodingStrategy to say "parse all blocks but don't decode any data".

What are your thoughts on the approach? Alternatives?

I'd rather be able to say FrameDecoder::get_frame_size() but I don' think there's a way to get that info without reading data from the Read. But this way I can skip past the frame and then check bytes_read_from_source().

KillingSpark commented 1 year ago

So the thing is, just decoding the blocks is already half the work necessary for decoding the file. The first three Bytes of each block are a simple header telling you all the info you need to skip that block and when the frame is complete.

I'd just initialize the frame decoder from your source and then only interpret these three bytes and then seek forward to the next block.

This will be way more efficient than decoding the blocks fully, which would mean actually reading each byte and decoding the fse tables etc etc

phord commented 1 year ago

I was trying to do it in some way that works with the API. "Only interpret those three bytes" is something the BlockDecoder is good at. I was going to extend BlockDecoder to expose some of that, but ~it's not even public~ hm... it is public, but I had trouble using it. I'll try again.

Then I wanted to add FrameDecoder::get_frame_size() but that requires reading all the box headers and it will not work with streaming modes.

What if I add FrameDecoder::skip_frame()? This could then also be used for skippable frames instead of relying on an Err Result to curry that information. It would work for skippable frames as well as block-content frames.

if frame.is_blocks() {
    frame.decode(...);
} else {
    frame.skip_frame(...);
}
KillingSpark commented 1 year ago

I think transparently skipping skippable frames is a bad idea. Those frames are likely important if they do exist. Imagine an unsuspecting user trying to decode single frame files. Just trying to decode the first frame will not work if there is a skippable frame at the front. Then the result will just be empty, without any notifications to that user.

I am opposed to making the block decoder any more complicated than it has to be. It is relatively hot code. Making this more complicated by adding more code-paths could lead to performance degradations.

Also: skipping is just inherently inefficient with an io::Reader. I'd much prefer to let the caller do that in the hopes that they have access to a io::Seek implementation or something similar.

"Only interpret those three bytes" is something the BlockDecoder is good at

I guess, but the hard part the BlockDecoder is also good at is decoding the whole block. Decoding these three bytes is very simple. Here is the relevant excerpt from the zstd doc:

After `Magic_Number` and `Frame_Header`, there are some number of blocks.
Each frame must have at least one block,
but there is no upper limit on the number of blocks per frame.

The structure of a block is as follows:

| `Block_Header` | `Block_Content` |
|:--------------:|:---------------:|
|    3 bytes     |     n bytes     |

__`Block_Header`__

`Block_Header` uses 3 bytes, written using __little-endian__ convention.
It contains 3 fields :

| `Last_Block` | `Block_Type` | `Block_Size` |
|:------------:|:------------:|:------------:|
|    bit 0     |  bits 1-2    |  bits 3-23   |
phord commented 1 year ago

I didn't consider that FrameDecoder takes a Read and not a Read + Seek. That does make it harder to implement skipping the skippable frames. Though, for streams, it would be the only sensible way.

I did manage to implement fn skip_frame(...) -> Result<(u64, u64), err> in terms of read_frame_header() and read_block_header().

let (uncompressed_bytes, frame_bytes) = self.skip_frame()?;

It does rely on Seek, and it is very fast. There's just one gap in the implementation, and that's around skippable frames. It's because

    match read_frame_header(&mut self.file) {
        Err(ReadFrameHeaderError::SkipFrame(_magic_num, skip_size,)) => {
            self.file.seek(SeekFrom::Current(skip_size as i64)).unwrap();
            // Skipped a frame with no uncompressible bytes
            // FIXME: Magic number "4" is the size of the frame header we parsed. read_frame_header should tell us that.
            Ok((Some(0), 4u64 + skip_size as u64))
        }

Since I have Seek, I could measure the skipped bytes myself. But I was hoping for something from the API.

    let fpos = self.file.stream_position().unwrap() as u64;
    match read_frame_header(&mut self.file) {
        Err(ReadFrameHeaderError::SkipFrame(_magic_num, skip_size,)) => {
            let new_pos = self.file.seek(SeekFrom::Current(skip_size as i64)).unwrap() as u64;
            // Skipped a frame with no uncompressible bytes
            Ok((Some(0), new_pos - fpos))
        }

Now, the other gimmick is that the only way to measure uncompressed_bytes reliably is to decompress the blocks, because all the uncompressed_size fields are optional. When they're present, though, indexing the file is quick.