Open aprescott opened 11 years ago
Note that Google's open-vcdiff
encoder, which also uses Bentley-McIlroy, doesn't seem to do this by default:
// *** look_for_target_matches:
// The VCDIFF format allows COPY instruction addresses to reference data from
// the source (dictionary), or from previously encoded target data.
//
// If look_for_target_matches is false, then the encoder will only
// produce COPY instructions that reference source data from the dictionary,
// never from previously encoded target data. This will speed up the encoding
// process, but the encoded data will not be as compact.
//
// If this value is true, then the encoder will produce COPY instructions
// that reference either source data or target data. A COPY instruction from
// the previously encoded target data may even extend into the range of the
// data being produced by that same COPY instruction; for example, if the
// previously encoded target data is "LA", then a single COPY instruction of
// length 10 can produce the additional target data "LALALALALA".
//
// There is a third type of COPY instruction that starts within
// the source data and extends from the end of the source data
// into the beginning of the target data. This VCDIFF encoder will never
// produce a COPY instruction of this third type (regardless of the value of
// look_for_target_matches) because the cost of checking for matches
// across the source-target boundary would not justify its benefits.
//
The default value of this flag is false
.
(Note that this whole issue is actually independent of VCDIFF's VCD_TARGET
window indicator bit, which concerns where S
comes from in the universal string, not where COPY
instructions can start and end.)
The following output is inefficient:
A more compact form would be
Contrast this to the sane output of:
In fact,
["abc", [0, 1], [3, 5]]
is already a working valid compressed form:The issue here is that we process
"abc"
, see the next"a"
and encode that into["abc", [0, 1]]
, which is fine, but then the next"a"
has no mapping into the now-encoded[0, 1]
form of"a"
. When looking for the match, we need to somehow detect that"a"
can work off[0, 1]
or[3, 1]
. Combined with extending matches forward, this would yield["abc", [0, 1], [3, 5]]
as desired.Similarly:
"abcababa" == ["abc", [0, 2], [3, 3]]
"abcabab" == ["abc", [0, 2], [3, 2]]