In Stream MD5 checksum - Githubissues

GoogleCodeExporter commented 9 years ago

To fully validation the resulting patched file (xdelta -d), I know the md5 
hash of the resulting file (from the original file when it was encoded) 
what I currently do is decode the patch into the file I then md5 checksum 
the file to make sure that it has created the file perfectly. 

What would be nice is that during the creation of the file there is an 
option -md5 that produces the md5 hash of the file as it is being built so 
the resulting file does not have to be rescanned to make the checksum.

To validate my files everything is md5 checksumed to work out if files 
have changed or if they are correct versions.

Original issue reported on code.google.com by a...@intralan.co.uk on 22 Dec 2007 at 5:56

GoogleCodeExporter commented 9 years ago

xdelta -d -s sourcefile delta | tee outputfile | md5sum

Original comment by nicolas....@gmail.com on 26 Dec 2007 at 8:07

GoogleCodeExporter commented 9 years ago

A couple of points,

1. I am running xdelta as a process run from a c# windows service and I think 
piping 
wont work (I could be wrong).

2. runing the above command would mean that the resuling file gets read after 
the 
file has been written (again I could be wrong), this would be extra I/O.

my suggestion is that the md5 checksum is created during the process to create 
the 
result file meaning a single read of the data (e.g. the data that is being 
streamed 
into the outputfile) this reduces disk I/O significantly.

I will give the above a go to see what happens.

Thanks

Original comment by a...@intralan.co.uk on 27 Dec 2007 at 8:12

GoogleCodeExporter commented 9 years ago

It's not an unreasonable request, although it would be better if there was a 
way for
xdelta to automatically verify the MD5.  The problem is there is no
currently-standardized method to embed the MD5sum at the end of the file 
encoding.

xdelta3 does verify the adler32 checksum of each window.  If you know the length
matches and all of the windows' adler32 checksums match, you can be reasonably 
sure
the file contents are correct.  Is this sufficient?

Original comment by josh.mac...@gmail.com on 27 Dec 2007 at 7:52

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

my suggestion comes from the fact that if you are creating the resulting 
outputfile, 
there is no overhead in disk I/O to be generating an md5 checksum whilst the 
file is 
being streamed to disk. I am guessing that most people that use this type of 
patching verify that the output is indeed perfect, I must say I have not yet 
found a 
single failure, but this does not mean it wont happen.

Personally I log the original checksum, the checksum that would be created if 
the 
file was patched, this double checks the process has worked perfectly. I will 
look 
at moving over to using adler32 for the checksums as this seems better in terms 
of 
performance over md5.

Maybe the other way to do this is store the outputfile's checksum in the vcdiff 
file, so that auto checking could happen.

Original comment by a...@intralan.co.uk on 28 Dec 2007 at 6:50

GoogleCodeExporter commented 9 years ago

The "tee" solution does not involve extra disk-IO.  That said, I agree with you 
in
principle.

The problem with your other suggestion, to store the outputfile's checksum in 
the
vcdiff file, is that vcdiff doesn't support such an annotation.  In fact, I had 
to
petition the vcdiff designer to add adler32 support--md5 is considered very 
expensive.

For the encoder to add the MD5 checksum, it needs to be added at the end of the
vcdiff encoding.  I will pass this idea around.  (I think application-specific
per-window metadata is generally useful.)

As for the decoder outputting the MD5 checksum, it's reasonable, but I don't 
think I
can justify it unless the encoder is also storing the checksum at the end of the
encoding.  I'll think about this support, but I want to remain part of the 
VCDIFF
standard and something needs to be added for this to work.

For now, I recommend the "tee" solution.

Original comment by josh.mac...@gmail.com on 28 Dec 2007 at 7:11

GoogleCodeExporter commented 9 years ago

I will look into the "tee" solution, thanks

Original comment by a...@intralan.co.uk on 28 Dec 2007 at 7:14

GoogleCodeExporter commented 9 years ago

Done some testing with Adler32, faster than md5 and the checksum is smaller,

909mb file

Adler32 took 11.8274956733328 seconds, checksum = 503813208
MD5 took 13.1282727273997 seconds, checksum = c469bb38bfd6937f1a868511b2d63ee4

so approx 10% speed impovement and 66% saving in the size of the checksum

is it posible to gather the adler32 checksum during the processing of xdelta on 
either the encode or decode and output them to the console, from what I read if 
you 
pipe the adler32 result of the previous window into the next checksum 
calculation 
you can produce a checksum for the whole file. Again no extra disk I/O.

Original comment by a...@intralan.co.uk on 28 Dec 2007 at 10:49

GoogleCodeExporter commented 9 years ago

Adler32 is a "weaker" checksum, the same used by gzip.

However, xdelta is computing it for each window, not for the entire file.

To compute the entire-file checksum would double the cost, and at that point i 
think
it would be preferrable to use MD5.  As I mentioned, I would like to share the 
idea
with others interested in VCDIFF development to see if we can find a solution,
because I'd like to recover the xdelta-1.x feature of encoding the MD5.

Original comment by josh.mac...@gmail.com on 28 Dec 2007 at 4:38

ECToo / xdelta

In Stream MD5 checksum #60