aprescott / bentley_mcilroy

Bentley–McIlroy data compression algorithm.
MIT License
5 stars 2 forks source link

Compression has suboptimal encodings #1

Open aprescott opened 11 years ago

aprescott commented 11 years ago

The following output is inefficient:

>> compress("abcaaaaaa", 1)
=> ["abc", [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]

A more compact form would be

["abc", [0, 1], [3, 5]]

Contrast this to the sane output of:

>> compress("aaaa", 1)
=> ["a", [0, 3]]

In fact, ["abc", [0, 1], [3, 5]] is already a working valid compressed form:

>> BentleyMcIlroy::Codec.decompress(["abc", [0, 1], [3, 5]])
=> "abcaaaaaa"

The issue here is that we process "abc", see the next "a" and encode that into ["abc", [0, 1]], which is fine, but then the next "a" has no mapping into the now-encoded [0, 1] form of "a". When looking for the match, we need to somehow detect that "a" can work off [0, 1] or [3, 1]. Combined with extending matches forward, this would yield ["abc", [0, 1], [3, 5]] as desired.

Similarly:

aprescott commented 11 years ago

Note that Google's open-vcdiff encoder, which also uses Bentley-McIlroy, doesn't seem to do this by default:

  // *** look_for_target_matches:
  // The VCDIFF format allows COPY instruction addresses to reference data from
  // the source (dictionary), or from previously encoded target data.
  //
  // If look_for_target_matches is false, then the encoder will only
  // produce COPY instructions that reference source data from the dictionary,
  // never from previously encoded target data.  This will speed up the encoding
  // process, but the encoded data will not be as compact.
  //
  // If this value is true, then the encoder will produce COPY instructions
  // that reference either source data or target data.  A COPY instruction from
  // the previously encoded target data may even extend into the range of the
  // data being produced by that same COPY instruction; for example, if the
  // previously encoded target data is "LA", then a single COPY instruction of
  // length 10 can produce the additional target data "LALALALALA".
  //
  // There is a third type of COPY instruction that starts within
  // the source data and extends from the end of the source data
  // into the beginning of the target data.  This VCDIFF encoder will never
  // produce a COPY instruction of this third type (regardless of the value of
  // look_for_target_matches) because the cost of checking for matches
  // across the source-target boundary would not justify its benefits.
  //

The default value of this flag is false.

(Note that this whole issue is actually independent of VCDIFF's VCD_TARGET window indicator bit, which concerns where S comes from in the universal string, not where COPY instructions can start and end.)