dennisvang / tufup

Automated updates for stand-alone Python applications.
MIT License
72 stars 1 forks source link

Really really slow (8+ hours) to generate patches for large files (450 mb) with 24GB RAM using bsdiff4. #154

Open mchaniotakis opened 5 days ago

mchaniotakis commented 5 days ago

First off, thanks a lot for this contribution of tufup, it is a great package and the only reliable solution as of now for an auto updating framework, I really appreciate the effort put in this and maintaining it.

Describe the bug I generate 2 versions of my app, exactly the same with the only difference being the version number. Following #69 I use os.environ["PYTHONSEED] = "0" and os.environ["SOURCE_DATE_EPOCH"] = "1577883661" on the file I am running pyinstaller.run() and on the .spec file as well (although its probably not needed in the spec file). Using bsdiff4 to generate patches between the 2 versions:

with gzip.open(file_1, mode='rb') as src_file:
    with gzip.open(file_2, mode='rb') as dst_file:
        bsdiff4.diff(src_bytes=src_file.read() , dst_bytes=dst_file.read())

Looking at my RAM it doesnt seem to become full at any point. This patch generation has been running now for about 8-9 hours.

Using this package: detools I can test the following:

image

Provided that I could generate a patch with the detools library, it would be possible to manually do so after a publish, with skip_patch = True and infuse the patch later. However, the patches generated for these bundles are around 350MB to 450MB, which is suspicious and not practical. Here is some code to create patches using detools:

pip install detools

and

from detools.create import create_patch , create_patch_filenames
output_file = "../../../mypatch.patch
with gzip.open(file_1, mode='rb') as src_file:
    with gzip.open(file_2, mode='rb') as dst_file:
        with open(output_file, "wb") as fpatch:
            create_patch(src_file ,dst_file , fpatch, algorithm = "match-blocks" , patch_type = "hdiffpatch" , compression = "none")

To Reproduce Steps to reproduce the behavior: I can provide two copies of the exact same versions that I used from my open sourced app. Feel free to use the code above to test patching with dsdiff4 and detools.

Expected behavior Using bsdiff4 the .diff() never completes (should be very small in size, hopefully less than 45 mb). Using detools the patch generation finishes within 2-10 minutes but the patches are around 350 to 450MB (the application bundle itself is 450 MB)

System info (please complete the following information):

Now I understand that this is a problem with possibly the implementation of bsdiff on bsdiff4, however, there is a size limit to files bsdiff can process (at 2 GB) while the hdiffpatch and match-blocks algorithms don't have that limit. I would appreciate any feedback on how should I go about debugging this.

mchaniotakis commented 5 days ago

Please ignore my comment above about detools as its only for applying patches. Using HDiffPatch I was able to create small patches using their binaries, and its also superfast. I will have to test if these diffs work with bsdiff patching (HDiffPatch repo says its supported). Eitherway, I do believe having the option to use HDiffPatch to handle large files is a huge advantage.

dennisvang commented 5 days ago

@mchaniotakis Thanks for the highly detailed report. Much appreciated!

Although it is well known that bsdiff4 is slow and memory hungry, as noted in #105, I have a feeling the issue you describe, taking 8+ hours without finishing, for a very small change in code, is abnormal.

We recently ran into a similar problem with a ~100MB tarball, where a relatively large change in code required a patch creation time of approx. 20 minutes (i.e. "normal"), whereas a small change of a few characters resulted in patch creation never finishing (at least not within a few hours).

I've spent some time trying to track down the cause, and trying to reproduce this issue in a minimal example using only bsdiff4, but without much success.

I'll see if I can find the time to dive into this again, and compare different patching solutions.

A short term alternative would be to provide a customization option, so users can provide their own patcher.

Note for newcomers: Before #105, we created patches from the .tar.gz, and I never saw this kind of behavior. However, as @mchaniotakis described in #69, the resulting patch files were basically useless, because they were nearly as large as the originals.