atomicobject / heatshrink

data compression library for embedded/real-time systems
ISC License
1.3k stars 175 forks source link

Possible data loss at stream end in -w4 -l2 #16

Closed silentbicycle closed 9 years ago

silentbicycle commented 9 years ago

I got an excellent bug report from @unixdj with detailed, minimal steps to reproduce data loss at the end of the stream:

$ echo -n aaaa | ./heatshrink -e -w4 -l2 | ./heatshrink -d -w4 -l2
a   # should be "aaaa"

A regression test has been added.

This is not currently known to occur under any other combinations of settings except -w4 -l2, the absolute minimum.

A change that fixes this without breaking reverse compatibility would be strongly preferred.

silentbicycle commented 9 years ago

The encoder is correct, the problem is in the decoder. The tag bit (1) plus the window size (4) plus the lookahead size (2) is only 7 bits, and the decoder can end up in a state where it has 7 bits of unprocessed data, but the end of input stream takes precedence and it prematurely indicates that it has completed.

This will not happen under any other configuration because everything else will use at least 8 bits, causing things to be buffered correctly.

silentbicycle commented 9 years ago

I'm considering closing this by increasing the minimum lookahead size to 3, rather than adding special cases. Thoughts, @unixdj?

silentbicycle commented 9 years ago

Resolved by 15ebaddc064e60a36b8e4f46b90c0a5a69940123 . The minimum lookahead needs to be increased to 3, because otherwise a state can occur at the end of the input whose interpretation is ambiguous, leading to either a few bytes being discarded or the last byte being repeated incorrectly.

unixdj commented 9 years ago

Increasing minimum lookahead size fixes the issue, of course.

The problem with the encoder is that it encodes "a" as [0xb0 0x80], which, if the decoder is fixed to handle the case above, may be interpreted by the decoder as:

1        tag bit: literal follows
01100001 'a'
0        tag bit: backref
0000     backref index 1
00       backref count 1

and decoded as "aa". The encoder, however, would never produce such stream (though it would make sense to). So in order to handle this correctly, there are two solutions:

(a) Make the decoder disregard backreferences shorter than the encoder would produce. (b) Make the encoder fill the empty bits in the last byte with ones instead of zeroes, so that the decoder would finish in the HSDS_YIELD_LITERAL state.

I would say that option (b) seems most conceptuallly sound.

silentbicycle commented 9 years ago

Indeed. I discovered that "aa" issue while investigating the root cause of the original issue, as well.

(b) seems like a good choice, and IIRC you've already done some investigation along those lines. If that covers all the cases, then the minimum length won't need to be increased after all.

Option (a) reminded me that there may be an opportunity to improve the compression ratio slightly (breaking reverse-compatibility), added as issue #17.

unixdj commented 9 years ago

Yes, I've tried (b), and I reckon it covers all the corner cases. Here's the code:

https://github.com/atomicobject/heatshrink/compare/master...unixdj:eof42