Support VCD_TARGET compression mode ("source" is earlier target window)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. I have a backup archive file with lots of redundant data (OS files, 
similar DB files, etc.) I want to use xdelta like rzip to find block 
matches within the same file. However, when I use xdelta without the -S 
parameter, it only uses 13 MB of RAM, and does not appear to find block 
matches within the same file. Gzip -9 outperforms xdelta significantly in 
this scenario. So I do not think xdelta is even trying to find block 
matches within the same file.
2. I would hope to achieve something like what rzip does. I did another 
contrived test wehere I took one ~400 MB file (file-a.dat), concatenated a 
smaller file (file-b.dat), and then concatnenated the same ~400 MB file 
again. The resulting file-c.dat was not compressed much at all by xdelta 
when no source file was used. See attachment for more details.

What is the expected output? What do you see instead?
I expect a file which is much smaller than what I am getting, as redundant 
blocks from the common OS files and common database pages are eliminated.

What version of the product are you using? On what operating system?
3.0q on Windows XP SP2

Please provide any additional information below.

Original issue reported on code.google.com by malay...@gmail.com on 14 May 2007 at 3:14

Attachments:

test.log

GoogleCodeExporter commented 9 years ago

I also tried a test that increased the -P parameter:
xdelta3 -P 536870912 file-c.dat file-c.dat.2.vcdiff

This resulted in a file slightly larger than my first test.

Any ideas, or is this something that is not in the design of xdelta3? I would 
really 
like to compress the "long distance redundancy" out of a large TAR file. I 
figure 
the savings would be on the order of 10x greater compression than gzip alone 
for my 
use case.

Original comment by malay...@gmail.com on 15 May 2007 at 7:34

GoogleCodeExporter commented 9 years ago

I have enough requests that I think this needs to work.  xdelta3 does not at 
present
implement the VCD_TARGET compression mode, which is where the source window is a
previous location in the target.

Original comment by josh.mac...@gmail.com on 14 Dec 2007 at 9:28

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Original comment by josh.mac...@gmail.com on 14 Dec 2007 at 9:37

Changed title: Support VCD_TARGET compression mode ("source" is earlier target window)

GoogleCodeExporter commented 9 years ago

Can the 64 MB offset window be adjusted? In teh case of backup files, the self-
similarity could be many GB apart.

Original comment by malay...@gmail.com on 14 Dec 2007 at 12:36

GoogleCodeExporter commented 9 years ago

The encoding format supports it.  I understand that's your request.

It will require some new functionality to be done well.  The current checksum 
is not
strong enough to entrust to wild disk seeking many gigabytes offset.

Original comment by josh.mac...@gmail.com on 14 Dec 2007 at 6:17

GoogleCodeExporter commented 9 years ago

Can I ask if you're using Adler-32 as the weak checksum, like rsync? Or is it 
something else? Would a 64-bit analogue of Adler-32 even be useful in this 
scenario? 
Or would the added size make things drastically less efficient? While I am a 
programmer, my last real work in C was 10 years ago, so I am a bit lost in the 
source. I fear Python and C# have lowered my programming IQ qutie a bit.

I'd eventually like to brew up some python that would enable deduplicated 
backups, 
much like rdiff-backup, but with a non-proprietary delta encoding format that 
actually remains the same between versions. 

Speaking from experience with rsync, I find I have to use really large block 
sizes 
(2MB+) to prevent large file operations from being CPU bound. I am not sure how 
block size interacts with the hash size in the case of Xdelta.

Original comment by malay...@gmail.com on 14 Dec 2007 at 11:06

GoogleCodeExporter commented 9 years ago

I'm using something similar to adler32, but I may switch to a Rabin-Karp style
checksum which is a little faster.  The issue is the block size: currently I 
use a
9-byte checksum for finding source copies.  For finding target copies across
gigabytes, where you will have to seek to read the data, a much longer checksum 
is
needed.  Like 2MB, as you suggest.  So, new code is required.

Original comment by josh.mac...@gmail.com on 14 Dec 2007 at 11:10

GoogleCodeExporter commented 9 years ago

I too have the requirement to seek for matches further apart than 32mb, my 
large 
files that I run through xdelta are exchange edb files, say a users mailbox 
gets 
deleted from the db then the matches could be over the 32mb barrier. What is 
strange 
is I am already comparing these files and have not had a failure, the command 
line I 
am using is 

xdelta.exe -e -v -W 16777216 -1 -f -
s "c:\oldfile.edb" "c:\newfile.edb" "c:\patchfile.dif" 

and my files upto 50gb in size!

Original comment by a...@intralan.co.uk on 22 Dec 2007 at 5:49

GoogleCodeExporter commented 9 years ago

My understanding is that Xdelta will not *fail*, it will just not find any 
matches 
that are more than 64 MB apart. Therefore it will not compress as well as it 
should.

This feature request is actually even more complicated: the desire is not only 
to 
find matches more than 64 MB in offset, but to also find block matches from 
within 
the target file as well as the source. That is, you could run xdelta without 
the -s 
flag at all, and still see significant compression if your source file 
contained 
lots of redundant files or blocks. This is usually true of backup files, 
especially 
backup files that contain binaries from multiple systems.

Original comment by malay...@gmail.com on 23 Dec 2007 at 12:26

GoogleCodeExporter commented 9 years ago

The plan is to make xdelta find very-large matches at arbitrary offsets in the 
file,
which will require a new strategy.  Note that at present, xdelta uses a 9-13 
byte
checksum to find matches in the source file, which is a good balance of speed 
and
accuracy for testing matches *in memory*.  However, to identify matches further 
back
in the source and/or target, we need far more accuracy because the cost of a 
false
positive is extremely high, since the disk has to seek.

Relatively high priority, given how many people are asking for this.  The 
"merge"
command and a few misc. features are probably more important.

Original comment by josh.mac...@gmail.com on 23 Dec 2007 at 7:27

GoogleCodeExporter commented 9 years ago

I would think that rsync and librsync should provide a good guide, with the 
whole "weak hash + strong hash" scheme. So you keep an adler32 checksum, as 
well as 
a stronger checksum (SHA-1, whatever). That's 24 bytes per hash entry. Or am I 
mis-
understanding the need?

I do recognize that the hash table entries can get overwhelming on large files 
with 
such a scheme. Rsync's default "block size equals sqaure root of file size" 
scheme 
is heavily CPU bound for files >4GB on my Core2 rig, so I think that should 
probably 
be addressed. I usually use 256 KB block sizes for files this large with rsync.

Original comment by malay...@gmail.com on 28 Dec 2007 at 4:01

GoogleCodeExporter commented 9 years ago

Given that xdelta3 already computes smaller checksums, I have some ideas about 
more
efficient approaches.  BTW I used rsync as a guide for the earliest version of
xdelta-1.x, so this is really a question of code organization in the VCDIFF 
setting.
 Thanks.

Original comment by josh.mac...@gmail.com on 28 Dec 2007 at 4:31

ECToo / xdelta

Support VCD_TARGET compression mode ("source" is earlier target window) #35