Open GoogleCodeExporter opened 9 years ago
I also tried a test that increased the -P parameter:
xdelta3 -P 536870912 file-c.dat file-c.dat.2.vcdiff
This resulted in a file slightly larger than my first test.
Any ideas, or is this something that is not in the design of xdelta3? I would
really
like to compress the "long distance redundancy" out of a large TAR file. I
figure
the savings would be on the order of 10x greater compression than gzip alone
for my
use case.
Original comment by malay...@gmail.com
on 15 May 2007 at 7:34
I have enough requests that I think this needs to work. xdelta3 does not at
present
implement the VCD_TARGET compression mode, which is where the source window is a
previous location in the target.
Original comment by josh.mac...@gmail.com
on 14 Dec 2007 at 9:28
Original comment by josh.mac...@gmail.com
on 14 Dec 2007 at 9:37
Can the 64 MB offset window be adjusted? In teh case of backup files, the self-
similarity could be many GB apart.
Original comment by malay...@gmail.com
on 14 Dec 2007 at 12:36
The encoding format supports it. I understand that's your request.
It will require some new functionality to be done well. The current checksum
is not
strong enough to entrust to wild disk seeking many gigabytes offset.
Original comment by josh.mac...@gmail.com
on 14 Dec 2007 at 6:17
Can I ask if you're using Adler-32 as the weak checksum, like rsync? Or is it
something else? Would a 64-bit analogue of Adler-32 even be useful in this
scenario?
Or would the added size make things drastically less efficient? While I am a
programmer, my last real work in C was 10 years ago, so I am a bit lost in the
source. I fear Python and C# have lowered my programming IQ qutie a bit.
I'd eventually like to brew up some python that would enable deduplicated
backups,
much like rdiff-backup, but with a non-proprietary delta encoding format that
actually remains the same between versions.
Speaking from experience with rsync, I find I have to use really large block
sizes
(2MB+) to prevent large file operations from being CPU bound. I am not sure how
block size interacts with the hash size in the case of Xdelta.
Original comment by malay...@gmail.com
on 14 Dec 2007 at 11:06
I'm using something similar to adler32, but I may switch to a Rabin-Karp style
checksum which is a little faster. The issue is the block size: currently I
use a
9-byte checksum for finding source copies. For finding target copies across
gigabytes, where you will have to seek to read the data, a much longer checksum
is
needed. Like 2MB, as you suggest. So, new code is required.
Original comment by josh.mac...@gmail.com
on 14 Dec 2007 at 11:10
I too have the requirement to seek for matches further apart than 32mb, my
large
files that I run through xdelta are exchange edb files, say a users mailbox
gets
deleted from the db then the matches could be over the 32mb barrier. What is
strange
is I am already comparing these files and have not had a failure, the command
line I
am using is
xdelta.exe -e -v -W 16777216 -1 -f -
s "c:\oldfile.edb" "c:\newfile.edb" "c:\patchfile.dif"
and my files upto 50gb in size!
Original comment by a...@intralan.co.uk
on 22 Dec 2007 at 5:49
My understanding is that Xdelta will not *fail*, it will just not find any
matches
that are more than 64 MB apart. Therefore it will not compress as well as it
should.
This feature request is actually even more complicated: the desire is not only
to
find matches more than 64 MB in offset, but to also find block matches from
within
the target file as well as the source. That is, you could run xdelta without
the -s
flag at all, and still see significant compression if your source file
contained
lots of redundant files or blocks. This is usually true of backup files,
especially
backup files that contain binaries from multiple systems.
Original comment by malay...@gmail.com
on 23 Dec 2007 at 12:26
The plan is to make xdelta find very-large matches at arbitrary offsets in the
file,
which will require a new strategy. Note that at present, xdelta uses a 9-13
byte
checksum to find matches in the source file, which is a good balance of speed
and
accuracy for testing matches *in memory*. However, to identify matches further
back
in the source and/or target, we need far more accuracy because the cost of a
false
positive is extremely high, since the disk has to seek.
Relatively high priority, given how many people are asking for this. The
"merge"
command and a few misc. features are probably more important.
Original comment by josh.mac...@gmail.com
on 23 Dec 2007 at 7:27
I would think that rsync and librsync should provide a good guide, with the
whole "weak hash + strong hash" scheme. So you keep an adler32 checksum, as
well as
a stronger checksum (SHA-1, whatever). That's 24 bytes per hash entry. Or am I
mis-
understanding the need?
I do recognize that the hash table entries can get overwhelming on large files
with
such a scheme. Rsync's default "block size equals sqaure root of file size"
scheme
is heavily CPU bound for files >4GB on my Core2 rig, so I think that should
probably
be addressed. I usually use 256 KB block sizes for files this large with rsync.
Original comment by malay...@gmail.com
on 28 Dec 2007 at 4:01
Given that xdelta3 already computes smaller checksums, I have some ideas about
more
efficient approaches. BTW I used rsync as a guide for the earliest version of
xdelta-1.x, so this is really a question of code organization in the VCDIFF
setting.
Thanks.
Original comment by josh.mac...@gmail.com
on 28 Dec 2007 at 4:31
Original issue reported on code.google.com by
malay...@gmail.com
on 14 May 2007 at 3:14Attachments: