Fix LZO compressor/decompressor

airlift / aircompressor

A port of Snappy, LZO, LZ4, and Zstandard to Java

Apache License 2.0

566 stars 112 forks source link

Fix LZO compressor/decompressor #88

Closed luohao closed 6 years ago

luohao commented 6 years ago

There are a few issues in the current implementation of LZO compressor and decompressor.

LzoRawDecompressor doesn't decode the command in LZO1X format correctly
Unit tests are broken
- verify decompressor in TestLzo is not set to the Hadoop Lzo implementation
- the HadoopLzoCompressor doesn't round-trip with HadoopLzoDecompressor

This PR fixes the above issues.

@dain

dain commented 6 years ago

Ok I think I figured out the "stop command". Each compressed sequence must end with a 0b0001_HMMM command where the match offset is zero. This check is already present in the code, and the extra checks for the "stop condition" were actually overly restrictive and not needed. The decompressor may get multiple compressed sequences (each ending with the zero match offset 0b0001_HMMM command), and it should continue until the input is exhausted.

This commit has the changes, and adjusts the test to cause the compressor to produce multiple blocks.

https://github.com/dain/aircompressor/commit/8722e7d82b33a08836b08b52684fc91cc230970e

If this looks good to you, can you squash this into your commits.

luohao commented 6 years ago

@dain I think you are correct. Theoretically, the decompressor may get multiple sequences. But for some reason, we don't see that in our datasets (compressed using hadoop-lzo).

Let me rerun Presto verifier tests with the change you add against our data.

dain commented 6 years ago

To get multiple chunks I had to set the Hadoop compression output buffer to be small enough that a single write doesn't fit in the output buffer (https://github.com/airlift/aircompressor/pull/88/files#diff-98f1a4be231a809a078478209f320e04R35). Maybe the writer code you are using is writing in smaller bits or has a large buffer. Also, for Parquet maybe it has it's own internal framing (like ORC does) to avoid these issues.

luohao commented 6 years ago

@dain Just finished Presto verifier tests with your change. Found no issues so I guess the change you add should be fine.

I was trying to add the sequence samples I used for debugging into the test data. However, when I decompress them and add the original data to the test suite, the two cases where copy length is fixed 2 bytes and 3 bytes are still not covered in unit-test. I guess the HadoopLzoCompressor doesn't behave exactly the same as the compressor we use. I can fix this later in another PR if you are OK with it.

dain commented 6 years ago

@luohao I'd like to have good coverage for those cases also. If you have some test data compressed by your internal implementation, that covers these cases, we can just check in the binary files.

I the mean time, I'll squash the lzo changes together and push it.

dain commented 6 years ago

Merged