dennisvang / tufup

Automated updates for stand-alone Python applications.
MIT License
100 stars 2 forks source link

Really really slow (8+ hours) to generate patches for large files (450 mb) with 24GB RAM using bsdiff4. #154

Open mchaniotakis opened 4 months ago

mchaniotakis commented 4 months ago

First off, thanks a lot for this contribution of tufup, it is a great package and the only reliable solution as of now for an auto updating framework, I really appreciate the effort put in this and maintaining it.

Describe the bug I generate 2 versions of my app, exactly the same with the only difference being the version number. Following #69 I use os.environ["PYTHONSEED] = "0" and os.environ["SOURCE_DATE_EPOCH"] = "1577883661" on the file I am running pyinstaller.run() and on the .spec file as well (although its probably not needed in the spec file). Using bsdiff4 to generate patches between the 2 versions:

with gzip.open(file_1, mode='rb') as src_file:
    with gzip.open(file_2, mode='rb') as dst_file:
        bsdiff4.diff(src_bytes=src_file.read() , dst_bytes=dst_file.read())

Looking at my RAM it doesnt seem to become full at any point. This patch generation has been running now for about 8-9 hours.

Using this package: detools I can test the following:

image

Provided that I could generate a patch with the detools library, it would be possible to manually do so after a publish, with skip_patch = True and infuse the patch later. However, the patches generated for these bundles are around 350MB to 450MB, which is suspicious and not practical. Here is some code to create patches using detools:

pip install detools

and

from detools.create import create_patch , create_patch_filenames
output_file = "../../../mypatch.patch
with gzip.open(file_1, mode='rb') as src_file:
    with gzip.open(file_2, mode='rb') as dst_file:
        with open(output_file, "wb") as fpatch:
            create_patch(src_file ,dst_file , fpatch, algorithm = "match-blocks" , patch_type = "hdiffpatch" , compression = "none")

To Reproduce Steps to reproduce the behavior: I can provide two copies of the exact same versions that I used from my open sourced app. Feel free to use the code above to test patching with dsdiff4 and detools.

Expected behavior Using bsdiff4 the .diff() never completes (should be very small in size, hopefully less than 45 mb). Using detools the patch generation finishes within 2-10 minutes but the patches are around 350 to 450MB (the application bundle itself is 450 MB)

System info (please complete the following information):

Now I understand that this is a problem with possibly the implementation of bsdiff on bsdiff4, however, there is a size limit to files bsdiff can process (at 2 GB) while the hdiffpatch and match-blocks algorithms don't have that limit. I would appreciate any feedback on how should I go about debugging this.

mchaniotakis commented 4 months ago

Please ignore my comment above about detools as its only for applying patches. Using HDiffPatch I was able to create small patches using their binaries, and its also superfast. I will have to test if these diffs work with bsdiff patching (HDiffPatch repo says its supported). Eitherway, I do believe having the option to use HDiffPatch to handle large files is a huge advantage.

dennisvang commented 4 months ago

@mchaniotakis Thanks for the highly detailed report. Much appreciated!

Although it is well known that bsdiff4 is slow and memory hungry, as noted in #105, I have a feeling the issue you describe, taking 8+ hours without finishing, for a very small change in code, is abnormal.

We recently ran into a similar problem with a ~100MB tarball, where a relatively large change in code required a patch creation time of approx. 20 minutes (i.e. "normal"), whereas a small change of a few characters resulted in patch creation never finishing (at least not within a few hours).

I've spent some time trying to track down the cause, and trying to reproduce this issue in a minimal example using only bsdiff4, but without much success.

I'll see if I can find the time to dive into this again, and compare different patching solutions.

A short term alternative would be to provide a customization option, so users can provide their own patcher.

Note for newcomers: Before #105, we created patches from the .tar.gz, and I never saw this kind of behavior. However, as @mchaniotakis described in #69, the resulting patch files were basically useless, because they were nearly as large as the originals.

dennisvang commented 4 months ago

Possibly related: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=409664

mchaniotakis commented 4 months ago

I was able to go around the bsdiff diff method by using the HDiffPatch binary provided in the repo i referenced above. The only part of the process that needs to change is when the patch is created. HDiffPatch can save the patches in the same format as bsdiff. The process is adjusted as follows: 1) Create the bundle as you normally would, but skip patch creation step , 2) Create the patch with HDiffPatch and compute the size and hash just like in here and 3) Do repo.roles.add_or_update_target and repo.publish_changes. The update process on the client should be the same. Here is a sample piece of code that I am using when creating the patch:

def create_patch_with_hdiff(latest_archive , new_archive , output_file):
   """
   latest_archive : path being the tar.gz path to the older version
   new_archive : path being the tar.gz file to the newly created version
   output_file : path to the output location of the .patch generated file
   """
    hdiff_path = "location_of_the_binary/hdiffz.exe "
    with tempfile.TemporaryDirectory() as tmpdirname:
        latest_archive_extracted = os.path.join(tmpdirname , "latest.tar")
        new_archive_extracted = os.path.join(tmpdirname , "new.tar")
        with gzip.open(latest_archive, "rb") as src_file:
            with open(latest_archive_extracted, "wb") as src_file_extracted:
                src_file_extracted.write(src_file.read())

        with gzip.open(new_archive, "rb") as dst_file:
            dst_tar_content = dst_file.read()
            dst_size_and_hash = Patcher._get_tar_size_and_hash(tar_content=dst_tar_content)
            with open(new_archive_extracted, "wb") as dst_file_extracted:
                dst_file_extracted.write(dst_tar_content)
            if os.path.exists(output_file):
                logger.info("Found old patch file, removing...")
                os.remove(output_file)
            # create patch
            process = subprocess.Popen([hdiff_path , "-s-4k" ,"-BSD",latest_archive_extracted , new_archive_extracted , output_file ])
            process.wait()
        return dst_size_and_hash
JessicaTegner commented 1 month ago

So @dennisvang what is your thoughts on this? Should tufup consider swithcing library for computing patches?

dennisvang commented 1 month ago

Hi @JessicaTegner, sorry for my really slow response.

In the long run it may be good to switch to a different library, but for now I would prefer just to implement a customization option that allows users to specify their own patcher.

One of the reasons is complexity of the update process when switching to a new technology. It's an interesting mix of challenges related to both backward compatibility and forward compatibility.

In addition, I would still like to know the underlying cause of this issue with bsdiff4. I've encountered the issue in production, but still have not been able to reproduce it consistently in a minimal example using bsdiff4 only.

JessicaTegner commented 1 month ago

Hey @dennisvang good to see that you are still around. Yes it's very interesting to observe this. I have been able to confirm the wildly different diff times for bsdiff4, and have tried different ways to fix it, or get around it. A custom patcher option seems like a good first step, however we should really either figure out if we can fix this upstream, or push for switching. interestingly though: Never saw this behaviour with pyupdater Edit 1: I tried using the "file_diff", compared to the normal "diff" method, and my ram doesn't seem to be eaten up as much as with the "diff" method. I attribute that partly to the 2 files not being loaded into memory, but that still seems worth noting here.

dennisvang commented 1 month ago

interestingly though: Never saw this behaviour with pyupdater

@JessicaTegner Me neither, as far as I can remember.

Then again, neither did I see this behavior with tufup before #105.

Edit 1: I tried using the "file_diff", compared to the normal "diff" method, and my ram doesn't seem to be eaten up as much as with the "diff" method. I attribute that partly to the 2 files not being loaded into memory, but that still seems worth noting here.

May be interesting to give that a try. I do know that pyupdater also used file_diff.

I'm not 100% sure, but seem to recall that pyupdater created the patches based on compressed archives (.zip on windows), instead of uncompressed ones...