facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.77k stars 2.11k forks source link

issues found while fuzzing to compare behaviour of Zig decoder to this repo #3508

Open dweiller opened 1 year ago

dweiller commented 1 year ago

We've been doing some fuzz tests of the a Zig implementation of Zstandard decoding we're working on that compare behaviour with this repo (often referred to as 'the reference' below). We've encountered a few cases where the behaviour differs that I think we need some help deciding if we have a bug, we've found a bug in the reference, or if we just have differences in how permissive the two implementations are in some edge cases.

There are 4 cases so far - if you'd prefer I split them into different issues let me know and I can do that and edit this one into just one of them. Any help diagnosing the issues below would be appreciated - either another pair of eyes that can verify my findings, telling me if I'm misreading the docs, or being pointed to any tools in this repo that can help pull apart some of the more complex inputs below would be helpful.

Case 1

Input:

00000000: 28b5 2ffd c40c 3500 0000 0000 0000 4500  (./...5.......E.
00000010: 00e9 9501 5404 0215 011f 0ccd 46         ....T.......F

This is a frame with a single compressed block where the literals section is RLE of byte 0x95 and the sequence section contains RLE literal, match and offset codes and a single sequence. The issue looks like the offset code is 0x02 which (according to the docs means 2 bits should be read from the FSE bit-stream to get the offset value. However, the FSE bit-stream is the single byte 0x01, which has 0 bits available (i.e. it's just the first byte and only contains the padding bits and start-of-stream marker). I think this input is malformed, but could be made well-formed by changing the offset code bytes to be 0x00 so that no bits are read from the bit-stream while still resulting in an offset of 1.

The reference decompresses this input into 53 0x95 bytes, which is what the Zig implementation does if the offset code byte is modified as mentioned. My interpretation is that the reference isn't non-conformant (unless it's a bug that causes problems on other inputs) due to this, just more permissive, and the Zig implementation is being compliant while rejecting this input.

Case 2

Input: the file compare-alloc/id:000032,sig:06,src:000123,time:1670472,execs:7740992,op:havoc,rep:2 in this zip file.

It's a bit too long to look at by hand I think, but the issue seems to be in the Huffman tree decoding (that is at least what the reference reports).

A collaborator of mine stepped through the error we were getting in the reference with a debugger and found it was due to it returning corruption_detected here because rankStats[1] is 0 at that point. I am not familiar with the reference implementation code so I might be misinterpreting what I'm reading there, but it looks like the reference code is saying that a Huffman tree needs to have an even number (and at least 2) symbols of weight 1 - I don't see anything in the docs that says this.

I'm not really sure about this one - I think either I'm missing something in the docs or it's a bug in the reference.

While investigating this issue, I also noticed that there is a check that the total weight is not zero (before considering the implied extra weight) which I don't see the docs say is an error either. This hasn't caused any differences between the Zig and reference implementations that I've seen.

Edit: I don't think I was clear above, but this case is successfully decoded by the Zig implementation.

Case 3

I've got 2 inputs:

28b5 2ffd 9019 ff7f fdc4 0500 00

and

28b5 2ffd 0000 0000 0005 0000

The first of these contains a single compressed block and the second contains an empty raw block followed by a compressed block. For both the compressed block is a block with block header 0x05 0x00 0x00 (the last three bytes of each input). The reference decodes these into zero bytes, but my reading of the docs is that they are malformed frames and that a compressed block should always have a content size of at least 2 (the literals and sequences headers are required and each use at least 1 byte).

I might be misinterpreting the first sentence or the section on compressed blocks which might be supposed to mean that a compressed block can have a Block_Size of zero (or one, but that would be confusing - what would that single byte mean?) - if this is the case I think the docs could be clarified. If I'm interpreting the docs correctly, then I think the reference is also compliant and is just more permissive here.

Case 4

Input:

00000000: 28b5 2ffd e4bc 0000 0000 0000 003d 0200  (./..........=..
00000010: 464b 0b0b e019 0e02 5871 30d1 0566 0907  FK......Xq0..f..
00000020: 0007 0007 00df ee7b ffff fb0f cfbf f7fd  .......{........
00000030: feff 1bff f7bb ffef ff03 ffff fbbf ffff  ................
00000040: 02a8 1430 e0ff 9fff 13b0 81ff 1430 80fd  ...0.........0..
00000050: 3ad1 0200 0601 023e 1c33 5a              :......>.3Z

This input is compare-stream/MalformedBlock/id:000019,sig:06,src:000142,time:24158073,execs:33704590,op:havoc,rep:2 in this zip file, and the output from our fuzzer, which includes logs from the reference implementation can be found in the file compare-stream/MalformedBlock/output.txt if anyone wants to look.

I have a less complete understanding of the issue in this frame as I think the issue is in the decoding of FSE tables in the sequences section. The following might be a bit verbose, but as I'm less sure of exactly what the issue is I though I'd better not elide anything I found while trying to diagnose the issue.

This is a single compressed block of (compressed) size 71. The literals section is type Compressed_Literals_Block with regenerated size 180, and compressed size 45. The Zig and reference implementations agree up to this point (the reference logs say ZSTD_decodeLiteralsBlock : cSize=48, nbLiterals=180 and I'm interpreting cSize to include the 3 bytes of the literals section header which is the bytes 0x46 4b 0b at offset 0x10).

The sequences section starts at offset 0x40, so there are 2 sequences and both implementations seem to be in sync up to here. The compression mode byte is a8, so all three table modes are FSE_Compressed_Mode. The Zig and reference implementations get the same first sequence (in terms of the literal_length, match_length and offset, but an error is reported by the Zig implementation indicating that there are not enough bits in the bit-stream to update the FSE states for the second sequence. The reference logs I've seen don't give enough information about how many bits some operations are consuming to completely tell how it's consuming the bits for the FSE tables or while decoding sequences (maybe there are some other flags we can turn on?).

This leads me to believe the issue (either in Zig or reference) lies somewhere in decoding the sequence FSE tables, with the Zig implementation consuming more bytes than the reference, causing the reverse FSE bit-stream to be shorter and run out of bits while the reference does not. The Zig implementation thinks that the reverse FSE bit-stream is the bytes 0xd1 02 00 06 01 02 (starting at offset 0x51) and it runs out while trying to read bits to update the offset state (all states apparently want to read 1 bit in order to update). I'm not sure how much longer the reference thinks the reverse FSE bit-stream is, but it says the second sequence is seq: litL=0, matchL=5, offset=125.

I haven't tried to decode the FSE tables by hand, (which would probably be my next step), hoping that either there might be a tool you trust to help me do this or someone has an idea what the problem might be/a better way to proceed.

dweiller commented 1 year ago

I just thought that I should have double checked what commit the fuzzing was done against - I believe it was 79bdb8cb but @squeek502 could confirm.

klauspost commented 1 year ago

Comparing against my Go implementation

Case 1: Returns unexpected EOF.

Case 2: Returns fse decompress returned: unexpected EOF - decompressing FSE compressed weights on huffman table over-reads 4 bits when testing here.

Case 3.1: My decoder returns block too small Case 3.2 zstd -d sample.zst v1.5.0 returns zstd: sample.zst: unsupported format. My decoder returns invalid input: magic number mismatch

Case 4: unexpected EOF. Sequences over-read by 8 bits on second sequence element.

It seems that we pretty much agree on the interpretation of the spec and have similar checks.

dweiller commented 1 year ago

It seems that we pretty much agree on the interpretation of the spec and have similar checks.

Everything except case 2. Looks like I forget to mention explicitly but my Zig implementation decodes that one without an error.

klauspost commented 1 year ago

@dweiller That is the "pretty much" qualifier :D

When the FSE decode is done, I check if there was more bits read than what was given on the input.

In this case bitsRead: 68, where 64 would be all bits available on the 12 byte input.

dweiller commented 1 year ago

In this case bitsRead: 68, where 64 would be all bits available on the 12 byte input.

I'm not sure what the 12 bytes you're referring to are (the total input for case 2 is 1106 bytes long). As far as I can tell you can't know how long the FSE table is before you decode it. The Zig implementation is reading 65 bits to decode the FSE table.

klauspost commented 1 year ago

I'm not sure what the 12 bytes you're referring to are (the total input for case 2 is 1106 bytes long). As far as I can tell you can't know how long the FSE table is before you decode it. The Zig implementation is reading 65 bits to decode the FSE table.

The headerByte is 12 and "The length of the FSE-compressed series is equal to headerByte (0-127)".

dweiller commented 1 year ago

The headerByte is 12 and "The length of the FSE-compressed series is equal to headerByte (0-127)".

Ah, yes - that's correct. But why do you say there would only be 64 bits available? That's 4 bytes less than indicated by the header byte (or 3 if the header byte is meant to be included - I don't recall right now).

klauspost commented 1 year ago

No, that is merely an internal counter. In total it reads 100/96 bits (including discarded bits on the last byte).

I only check for over-reads. Under-reads should technically also be corruption, since the last bit is used for alignment.

terrelln commented 1 year ago

Thanks for the extremely detailed reports @dweiller!

I'll go through each case and respond to them individually, though I may not find the time until later this week.

Cyan4973 commented 1 year ago

Regarding case 2 :

A collaborator of mine stepped through the error we were getting in the reference with a debugger and found it was due to it returning corruption_detected here because rankStats[1] is 0 at that point. I am not familiar with the reference implementation code so I might be misinterpreting what I'm reading there, but it looks like the reference code is saying that a Huffman tree needs to have an even number (and at least 2) symbols of weight 1 - I don't see anything in the docs that says this.

This is a very good point.

The symbols of the Huffman table have frequencies described as "weights", which are powers of 2. Weight 1 is the smallest possible one, and represents a frequency of 2^(1-1) = 1, then Weight 2 represents a frequency of 2, Weight 3 represents a frequency of 4, Weight 4 represents a frequency of 8, etc, The total of all frequencies of all Symbols must be a clean power of 2, otherwise it's a corrupted table.

Now, in the wording used above, note that "Weight 1" is described as "the smallest possible weight", and that's also the wording employed in the specification. But in the way it's been used in the Zstandard reference implementation, it's actually the weight of the symbol of smallest frequency. The difference may seem small, but it follows that there is necessarily at least one "smaller" non-zero frequency symbol, and this symbol necessarily receives the Weight 1. It follows that the nb of symbols with Weight==1 is necessarily >= 1. Furthermore, in order for the sum of frequencies to reach a clean power of 2, it follows that the nb of Weights 1 must necessarily be even. Therefore, it must be >= 2.

Now, that's the way the Zstandard reference implementation works, but indeed, in the RFC8878 specification, nowhere is it stated that weight 1 necessarily exists because it represents the weight of the smallest non-zero frequency symbol. Therefore, it's also possible to create a "valid" huffman table, where the sum of all frequencies is a clean power of 2, but no symbol ever receives a weight of 1. Sure, this is slightly more wasteful, but the inefficiency is no big deal, so it's possible.

This leaves us with 2 options :

While none of the 2 options above are ideal, it looks to me that option 2 is potentially more disruptive to the ecosystem.

That being said, it's certainly worth a discussion.

dweiller commented 1 year ago

Okay - this makes sense to me, thanks :D.

Checking over the docs on Huffman coding again, I saw/remembered an issue with the example for determining the final implicit weight - in the example it's said Weight[5] = 16-15 = 1 but I believe it should rather say Weight[5] = log_2(16 - 15) + 1 = 1.

Now, in the wording used above, note that "Weight 1" is described as "the smallest possible weight", and that's also the wording employed in the specification. But in the way it's been used in the Zstandard reference implementation, it's actually the weight of the symbol of smallest frequency. The difference may seem small, but it follows that there is necessarily at least one "smaller" non-zero frequency symbol, and this symbol necessarily receives the Weight 1. It follows that the nb of symbols with Weight==1 is necessarily >= 1. Furthermore, in order for the sum of frequencies to reach a clean power of 2, it follows that the nb of Weights 1 must necessarily be even. Therefore, it must be >= 2.

Thank you for the explanation :D. I think it should be possible to have only a single weight 1 symbol in the (admittedly silly) situation where the Huffman tree only has a single non-zero weight any every other symbol has weight zero, unless I'm missing a part of the docs that preclude this (as you should use RLE literals for this). FSE compressed Huffman trees would need to have at least two non-zero weights as the section on FSE table descriptions says 'there must be two or more symbols with nonzero probability' but I don't see an analogous condition on direct encoded Huffman trees.

As for the two options mentioned, I probably don't understand the all the considerations you may have in terms of the ecosystem, but I think the first option with fixing the specification makes sense. I wouldn't imagine there are any encoders that actually encode Huffman trees that don't use a weight 1 symbol (which I think is the only thing that would cause anyone an issue if the specification is fixed?).

Cyan4973 commented 1 year ago

I think it should be possible to have only a single weight 1 symbol in the (admittedly silly) situation where the Huffman tree only has a single non-zero weight any every other symbol has weight zero, unless I'm missing a part of the docs that preclude this (as you should use RLE literals for this).

If there is only 1 symbol, then the RLE mode must be used. It's not just that it is more efficient, it's also that the huffman implementation doesn't work if there is a single symbol : since such a symbol would then cost 0 bit, the bitstream wouldn't know when to stop !

Maybe the wording of the specification is not clear enough for this corner case, and therefore could be updated for improved clarity. That could be ground for another RFC erratum (which are not "errors" strictly speaking, oftentimes, they are just clarifying comments about corner cases, like this one).

in the example it's said Weight[5] = 16-15 = 1 but I believe it should rather say Weight[5] = log_2(16 - 15) + 1 = 1.

Yes, you are right.

Cyan4973 commented 1 year ago

Regarding case 3 :

For both the compressed block is a block with block header 0x05 0x00 0x00 (the last three bytes of each input). The reference decodes these into zero bytes, but my reading of the docs is that they are malformed frames and that a compressed block should always have a content size of at least 2 (the literals and sequences headers are required and each use at least 1 byte).

I believe you are right. A "compressed" block should have a size of at least 2, even if it does represent an empty block, if only to indicate that both Literals and Sequences sections are empty.

It's unclear to me why the reference decoder implementation does not error out for this case. This is likely worth an investigation.

klauspost commented 1 year ago

IMO the FSE should also match the exact size. Since it is decoded from the back, there is no valid encoding that should over-read. Since it was never part of the spec, specifying that all overread bits are '0' wouldn't be feasible IMO.

Under-reading is a bit more tricky, since it doesn't give undefined behavior. But the spec should say that extra bits should either A) cause an error or B) be ignored, just to not give diverging decoders.

Cyan4973 commented 1 year ago

Regarding Case 1

My interpretation of this scenario is that the decoder is instructed to read more bits than are present in the bitstream. Therefore, it would seem logical that the decoder would error out.

Well, the bitstream of the reference decoder is designed in a way that, if instructed to produce more bits than it contains, it produces zeroes. And it also returns an error flag. Therefore, in theory, the decoder should check the flag after every read, and branch out if it's an error.

It turns out that this is a very hot part of the code, probably the most performance critical. Adding branches and exits has a cost. Reading more zeroes does not risk any buffer overflow, because they are "made up", not actually read. But indeed, if we reach this point, data is certainly corrupt, so it would be useful to report it.

However, an additional issue is that the bitstream is almost always over-read. This is part of a technique designed to reduce the amount of overhead from FSE bitstreams. Essentially, the last symbol is read from the last state value, and updating the state value is going to over-read the bitstream. It's not important, because the state value is not read again. But the bitstream will signal the over-read with an error flag. At this point, it's no longer possible to determine if the over-read is a consequence of updating after the last symbol, or if it's due to another reason, such as reading additional bits. Both scenarios would produce the same flag. Before that point, adding intermediate checks would reduce speed, aka consume energy, across the entire installed-based of reference decoders. All this to more gracefully handle a rare invalid corner case. So the trade-off is not obvious.

If the data is corrupted and a checksum is present, it will be detected (with a high probability). But if there is no checksum, corrupted data is produced, and there is no signal. This is not optimal, but note that there is no way to guarantee that any corruption will be detected without even using the checksum, so it was never part of the contract. A simple case is when literals are transmitted uncompressed: if the uncompressed literals bytes are corrupted, there's no way to detect it (except with the final checksum). So, rather, the emphasis is to return a corruption error whenever it's relatively easy to detect, and doesn't cost performance.

For case 1, it seems that the "easy to detect" bar is not so far. It's more the performance cost which is at stake.

It might be possible to make some subtle changes to the decoding loop to make it possible to detect this issue, and then return an error, all this without impacting speed. If that's the case, then yes, this would be a positive update. But it might also be difficult, more than it sounds, to achieve that objective without impacting performance negatively. So this effort remains to be done.

In any case, your interpretation of the spec looks correct, and your zig implementation of the decoder is simply more strict than reference decoder.

klauspost commented 1 year ago

I have full respect for that.

My only objection is that this is considered the reference decoder. This means that while we wouldn't expect to see these things become emitted from (lib)zstd, we could get into a situation where an encoder relies on these quirks.

Since the reference encoder doesn't flag these we could realistically see an extra value being stuffed on the FSE stream.

Or it relying on the encoder to return zeros when over-reading, so when starting it would simply throw any any 0 bits, since the encoder pads them. That might even save a byte here or there.

For the Go implementation I check the value of "bits read" for the FSE streams whenever a block is finished. Right now I don't report under-reads, since that is an open issue, but it will error on over-reads.

My main concern is not a big one. It is just that as a port, we could get into situations where we get failures on content from other third-party ports. So our only option seems to be to add these quirks to our ports, or live with the risk of our implementation suddenly breaking.

Simply referencing that we are "following the spec" isn't really a persuasive argument. If the reference decoder would reject it, it would be a clear case.

Cyan4973 commented 1 year ago

I understand the issue for the ecosystem.

Note that the reference decoder is also deployed, and in prod it carries the responsibility to provide best-in-class performance for its user applications.

So I think that what this request points at is a need for a validation decoder. Something that would be used primarily to validate conformance of 3rd party implementations, and would be very strict regarding what is allowed or not. Such an implementation would be free of shenanigans such as speed or memory optimizations, and could focus on its one job.

edit : created https://github.com/facebook/zstd/issues/3546 to track this topic.