Open mchaniotakis opened 4 months ago
Please ignore my comment above about detools as its only for applying patches. Using HDiffPatch I was able to create small patches using their binaries, and its also superfast. I will have to test if these diffs work with bsdiff patching (HDiffPatch repo says its supported). Eitherway, I do believe having the option to use HDiffPatch to handle large files is a huge advantage.
@mchaniotakis Thanks for the highly detailed report. Much appreciated!
Although it is well known that bsdiff4
is slow and memory hungry, as noted in #105, I have a feeling the issue you describe, taking 8+ hours without finishing, for a very small change in code, is abnormal.
We recently ran into a similar problem with a ~100MB tarball, where a relatively large change in code required a patch creation time of approx. 20 minutes (i.e. "normal"), whereas a small change of a few characters resulted in patch creation never finishing (at least not within a few hours).
I've spent some time trying to track down the cause, and trying to reproduce this issue in a minimal example using only bsdiff4
, but without much success.
I'll see if I can find the time to dive into this again, and compare different patching solutions.
A short term alternative would be to provide a customization option, so users can provide their own patcher.
Note for newcomers: Before #105, we created patches from the .tar.gz
, and I never saw this kind of behavior. However, as @mchaniotakis described in #69, the resulting patch files were basically useless, because they were nearly as large as the originals.
Possibly related: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=409664
I was able to go around the bsdiff diff method by using the HDiffPatch binary provided in the repo i referenced above. The only part of the process that needs to change is when the patch is created. HDiffPatch can save the patches in the same format as bsdiff. The process is adjusted as follows: 1) Create the bundle as you normally would, but skip patch creation step , 2) Create the patch with HDiffPatch and compute the size and hash just like in here and 3) Do repo.roles.add_or_update_target and repo.publish_changes. The update process on the client should be the same. Here is a sample piece of code that I am using when creating the patch:
def create_patch_with_hdiff(latest_archive , new_archive , output_file):
"""
latest_archive : path being the tar.gz path to the older version
new_archive : path being the tar.gz file to the newly created version
output_file : path to the output location of the .patch generated file
"""
hdiff_path = "location_of_the_binary/hdiffz.exe "
with tempfile.TemporaryDirectory() as tmpdirname:
latest_archive_extracted = os.path.join(tmpdirname , "latest.tar")
new_archive_extracted = os.path.join(tmpdirname , "new.tar")
with gzip.open(latest_archive, "rb") as src_file:
with open(latest_archive_extracted, "wb") as src_file_extracted:
src_file_extracted.write(src_file.read())
with gzip.open(new_archive, "rb") as dst_file:
dst_tar_content = dst_file.read()
dst_size_and_hash = Patcher._get_tar_size_and_hash(tar_content=dst_tar_content)
with open(new_archive_extracted, "wb") as dst_file_extracted:
dst_file_extracted.write(dst_tar_content)
if os.path.exists(output_file):
logger.info("Found old patch file, removing...")
os.remove(output_file)
# create patch
process = subprocess.Popen([hdiff_path , "-s-4k" ,"-BSD",latest_archive_extracted , new_archive_extracted , output_file ])
process.wait()
return dst_size_and_hash
So @dennisvang what is your thoughts on this? Should tufup consider swithcing library for computing patches?
Hi @JessicaTegner, sorry for my really slow response.
In the long run it may be good to switch to a different library, but for now I would prefer just to implement a customization option that allows users to specify their own patcher.
One of the reasons is complexity of the update process when switching to a new technology. It's an interesting mix of challenges related to both backward compatibility and forward compatibility.
In addition, I would still like to know the underlying cause of this issue with bsdiff4
. I've encountered the issue in production, but still have not been able to reproduce it consistently in a minimal example using bsdiff4
only.
Hey @dennisvang good to see that you are still around. Yes it's very interesting to observe this. I have been able to confirm the wildly different diff times for bsdiff4, and have tried different ways to fix it, or get around it. A custom patcher option seems like a good first step, however we should really either figure out if we can fix this upstream, or push for switching. interestingly though: Never saw this behaviour with pyupdater Edit 1: I tried using the "file_diff", compared to the normal "diff" method, and my ram doesn't seem to be eaten up as much as with the "diff" method. I attribute that partly to the 2 files not being loaded into memory, but that still seems worth noting here.
interestingly though: Never saw this behaviour with pyupdater
@JessicaTegner Me neither, as far as I can remember.
Then again, neither did I see this behavior with tufup before #105.
Edit 1: I tried using the "file_diff", compared to the normal "diff" method, and my ram doesn't seem to be eaten up as much as with the "diff" method. I attribute that partly to the 2 files not being loaded into memory, but that still seems worth noting here.
May be interesting to give that a try. I do know that pyupdater also used file_diff
.
I'm not 100% sure, but seem to recall that pyupdater created the patches based on compressed archives (.zip
on windows), instead of uncompressed ones...
First off, thanks a lot for this contribution of tufup, it is a great package and the only reliable solution as of now for an auto updating framework, I really appreciate the effort put in this and maintaining it.
Describe the bug I generate 2 versions of my app, exactly the same with the only difference being the version number. Following #69 I use os.environ["PYTHONSEED] = "0" and os.environ["SOURCE_DATE_EPOCH"] = "1577883661" on the file I am running pyinstaller.run() and on the .spec file as well (although its probably not needed in the spec file). Using bsdiff4 to generate patches between the 2 versions:
Looking at my RAM it doesnt seem to become full at any point. This patch generation has been running now for about 8-9 hours.
Using this package: detools I can test the following:
Provided that I could generate a patch with the detools library, it would be possible to manually do so after a publish, with skip_patch = True and infuse the patch later. However, the patches generated for these bundles are around 350MB to 450MB, which is suspicious and not practical. Here is some code to create patches using detools:
and
To Reproduce Steps to reproduce the behavior: I can provide two copies of the exact same versions that I used from my open sourced app. Feel free to use the code above to test patching with dsdiff4 and detools.
Expected behavior Using bsdiff4 the .diff() never completes (should be very small in size, hopefully less than 45 mb). Using detools the patch generation finishes within 2-10 minutes but the patches are around 350 to 450MB (the application bundle itself is 450 MB)
System info (please complete the following information):
Now I understand that this is a problem with possibly the implementation of bsdiff on bsdiff4, however, there is a size limit to files bsdiff can process (at 2 GB) while the hdiffpatch and match-blocks algorithms don't have that limit. I would appreciate any feedback on how should I go about debugging this.