Frommi / miniz_oxide

Rust replacement for miniz
MIT License
168 stars 48 forks source link

Improve block type selection algorithm #100

Open claudiosdc opened 3 years ago

claudiosdc commented 3 years ago

I am currently developing a library that uses data compression, and to handle that task I have chosen to use the flate2 crate with its default compression backend, miniz_oxide.

While in the process of writing unit tests for my library, I noticed that a particular piece of data was not generating the expected compression result. After further investigation, I realized that the data produced from the compression was made of a single non-compressed data block. That same data, however, when compressed using zlib, produces a different result, which is comprised of one compressed data block.

This can be verified using the code snippet below.

    #[test]
    fn it_compress_issue() {
        let data = r#"{"status":"success","data":{"messageId":"mg9x9vCqYMg9YtKdDwQx"}}"#.as_bytes();

        // Compression using 'miniz_oxide' crate directly
        let compressed_data = miniz_oxide::deflate::compress_to_vec(data, 9);

        assert!(compressed_data.len() > data.len());
        assert_eq!(&compressed_data.as_slice()[5..], data);

        // Compression using 'flate2' crate with 'zlib' feature enabled
        let mut enc = flate2::read::DeflateEncoder::new(data, Compression::default());
        let mut compressed_data_2 = Vec::new();

        enc.read_to_end(&mut compressed_data_2).unwrap();

        assert!(compressed_data_2.len() < data.len());
    }

This might be related to issue #77, I guess.

oyvindln commented 3 years ago

Yeah it might similar to what causes differences in #77, the block selection algorithm being a bit too dumb. You could check by seeing if you get the same result with the C miniz backend (or C miniz with same settings).

oyvindln commented 3 years ago

Yeah, looked at it a bit, it's due to the simpler block selection algorithm in miniz_oxide (and C miniz). May change it to do a more thorough check like zlib, though it requires a little restructuring.