Compression has suboptimal encodings

aprescott / bentley_mcilroy

Bentley–McIlroy data compression algorithm.

MIT License

5 stars 2 forks source link

The following output is inefficient:

>> compress("abcaaaaaa", 1)
=> ["abc", [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]

A more compact form would be

["abc", [0, 1], [3, 5]]

Contrast this to the sane output of:

>> compress("aaaa", 1)
=> ["a", [0, 3]]

In fact, ["abc", [0, 1], [3, 5]] is already a working valid compressed form:

>> BentleyMcIlroy::Codec.decompress(["abc", [0, 1], [3, 5]])
=> "abcaaaaaa"

The issue here is that we process "abc", see the next "a" and encode that into ["abc", [0, 1]], which is fine, but then the next "a" has no mapping into the now-encoded [0, 1] form of "a". When looking for the match, we need to somehow detect that "a" can work off [0, 1] or [3, 1]. Combined with extending matches forward, this would yield ["abc", [0, 1], [3, 5]] as desired.

Similarly:

"abcababa" == ["abc", [0, 2], [3, 3]]
"abcabab" == ["abc", [0, 2], [3, 2]]

// *** look_for_target_matches: // The VCDIFF format allows COPY instruction addresses to reference data from // the source (dictionary), or from previously encoded target data. // // If look_for_target_matches is false, then the encoder will only // produce COPY instructions that reference source data from the dictionary, // never from previously encoded target data. This will speed up the encoding // process, but the encoded data will not be as compact. // // If this value is true, then the encoder will produce COPY instructions // that reference either source data or target data. A COPY instruction from // the previously encoded target data may even extend into the range of the // data being produced by that same COPY instruction; for example, if the // previously encoded target data is "LA", then a single COPY instruction of // length 10 can produce the additional target data "LALALALALA". // // There is a third type of COPY instruction that starts within // the source data and extends from the end of the source data // into the beginning of the target data. This VCDIFF encoder will never // produce a COPY instruction of this third type (regardless of the value of // look_for_target_matches) because the cost of checking for matches // across the source-target boundary would not justify its benefits. //

aprescott / bentley_mcilroy

Compression has suboptimal encodings #1