endlessm / xdelta3-dir-patcher

Tool for generating XDelta3 diff packages and applying them
GNU Lesser General Public License v2.1
24 stars 5 forks source link

Having some problems making this work... #27

Open i30817 opened 6 years ago

i30817 commented 6 years ago

https://gist.github.com/i30817/44c8dbc4f2e4e23e33956132cdee71d6

a.zip and b.zip are just the same file, duplicated to minimize testing factors.

I downloaded the master repository zip, extracted it and opened a console. Took the master zip file, copied it and renamed both of them to a.zip and b.zip then: python3 ./xdelta3-dir-patcher diff a.zip b.zip patch

This worked.

and python3 ./xdelta3-dir-patcher apply --ignore-euid a.zip patch out

This really didn't work (--ignore-euid is to not depend on root).

XDELTA FAIL: 1 xdelta3: using default source filename: /tmp/XDelta3DirPatcher_delta_expanded12fvofkv/xdelta/xdelta3-dir-patcher-master/tests/test_dir_listing.py xdelta3: source file too short: XD3_INVALID_INPUT xdelta3: normally this indicates that the source file is incorrect xdelta3: please verify the source file with sha1sum or equivalent

BTW how does this utility react if the files are named different and there are different contents but there is still a large part of the bytes of the files that are the same (but not all)? That's the case i'm mostly interested. Does it compress much worse because it assumes that one of the source files was deleted instead of transformed? Does it need the name sort to give out the same relative ordering even if the names were changed to be effective?

I wonder if it would be worth it to put a user option when creating the patch to sort the source files by filesize for the tar file.

i30817 commented 6 years ago

here is the debug output of that patch application: https://gist.github.com/i30817/5c23e67ada58611638a345f1eb635e37

It ends up with just the folder structure in out and no files at all.

sgnn7 commented 6 years ago

@i30817 Some notes:

Any way that you can post (or email me at <my username>@sgnn7.org) a.zip and b.zip for me to do testing?

i30817 commented 6 years ago

It's just the download of the release zip file here (i did rename the README.md to a.md because i was curious about how it would react, but when it blew up, i started using identical files for source and destination).

Files not being allowed to be renamed makes it not useful to me... i really expected that the 'same name' part was optional (ie: either everything collected into a tar file of the same ordering and then the delta created, or first pass checks common files on both source and destination and does it by file, second pass for the remaining files collects files into tar files of the same ordering and calculates the delta).

Granted i'm not sure how if delta would react well to 'minor' differences on the files. Imagine two cd images that are dumped differently, one with separate tracks in single files and the other with concatenated tracks in a single file, both with a .cue. The way i'd do it for those archives is to sort alphabetically, put all files into a single tar and only then do the xdelta since none of these two archives have common files but the user must think they're at least related to try to do the delta.

i30817 commented 6 years ago

python3 --version Python 3.5.2

if you're wondering.

sgnn7 commented 6 years ago

@i30817 If this is what you are doing, then you could use this tool but would need to wrap the content in a container format that xdelta can easily diff. You could add files alphabetically to a tar file and run delta on that and it would create the results that you want (I think, though I'm not 100% sure). This would be able to handle most of the edge cases that you are concerned about but you wouldn't have much in terms of permission/uid/gid handling or checkpointing.

i30817 commented 6 years ago

I was actually interested in the zip support originally as in a way not to have to care about the makework code of extracting from two archives (because of different compression methods and levels), sorting the files, feeding them to a tar archive and then doing the xdelta myself. Sure i can't convince you to add a --block switch to treat the whole directory (zip or dir) as a single file for purposes of tar concatenation? Because it appears to me 90% of it is already there*.

I'm actually also curious about how well xdelta compression performs in cases like this. Does it find a good largest common substring (of bytes) between two files even if the offsets are not exactly the same, or does its window doesn't permit that?

And if xdelta is always lookahead, can we do all the steps (decompress two zips, create two tars, feed them to xdelta) as a stream so that no byte except the final patch (and in reverse, the extracted files) gets written to disk (no 'temporary' huge multigigabytes tar archives or zip extractions)?

The tedious work is really tedious if you want to do this right as i'm sure you know (stream programming so as few bytes as possible get written to disk and wear down your write cycles).

*Speaking that, switches to tweak the xdelta settings (the input and output memory window and compression level) would be a good idea too.

sgnn7 commented 6 years ago

I'm actually also curious about how well xdelta compression performs in cases like this. Does it find a good largest common substring (of bytes) between two files even if the offsets are not exactly the same, or does its window doesn't permit that?

xdelta3 does pretty well in those cases as far as I know (https://tools.ietf.org/html/rfc3284#page-22 under "Performance")

Sure i can't convince you to add a --block switch to treat the whole file (zip or dir) as a single file for purposes of tar concatenation?

The only thing I can tell you is that it's really low on my priority list given the amount of side projects I'm working on nowadays but if you write down the exact behaviour that you would like to see with that switch it would alleviate a lot of effort for me (or someone else that tries to implement it).

i30817 commented 6 years ago

Ok, never mind, i'm getting to grips with the python bindings to xdelta3 right now. I even have something that works (i think), even if it's only compressing two files at a time right now: https://gist.github.com/i30817/06e5f18ac39d1d1c1765338cc5631139

I need to make the file array generator copy more than a single file bytes and see how that falls out with the patching option.

i30817 commented 6 years ago

Finished the utility (same link as above until i publish something in pypi) and extended the patch size to > 1 per file on both zip streams. As for the original error, i still don't know what i did wrong there.