Large patch sizes - Githubissues

mchaniotakis commented 1 year ago

Describe the bug I have generated a Version 1 of a python application buddled with pyinstaller. This package contains images, libraries, the .exe and my .py files that have been converted to .pyd (binaries). One of those .pyd files states the version of the file. If only change the version of that .pyd file without running pyinstaller again to generate the second version of the bundle with tufup I get a file difference of 200MB, which is crazy if you take into account that the whole package is 340MB. The last modification date for these files are the same except the .pyd file file that states the version. Using the bsdiff4.file_diff() method between these two version produces the same result. I can provide both of these files If needed.

To Reproduce Steps to reproduce the behavior:

Run cython and pyinstaller with .spec required to make the bundle.
Copy all files except 1 folder containing some images
Modify version.py file and re-run cython for that file to generate version.pyd and copy over the 1 folder mentioned above from the source (folder has not changed and copied with shutil.cptree() so the modification dates are the same)
When I run the repo.add_bundle(new_bundle_dir=bundle_dir) method I get the filesize mentioned above for the patch

Expected behavior A patch size that is less than 10 MB. On a previous run, I regenerated just the .exe (running pyinstaller and just copying the .exe and deleting everything else while I follow the steps mentioned above.) The .exe filesize is 17mb while the generated patch was 35MB for that run.

System info (please complete the following information):

OS: [Window 11]
Python version 3.9
Pyinstaller version 5.9.0
Tufup version 0.4.9
bsdiff4 version 1.2.3

dennisvang commented 1 year ago

@mchaniotakis Thanks for providing such a detailed report.

You are right, these excessively large patches for small changes are not very useful, to say the least.

Tufup was created as a replacement for PyUpdater (because PyUpdater is no longer maintained). For this reason, the patch creation in tufup using bsdiff4 is basically a naive copy of PyUpdater's make_patch (see inputs here).

Although I did add some tests for basic patch functionality, I must admit, I haven't paid very much attention to the resulting file sizes.

The use of bsdiff4, in itself, does not seem to be a problem. Rather, the problem comes from the fact that we use it, naively, to create binary differences of .tar.gz archives.

It appears that binary diffs of either uncompressed .tar files or non-tar .gz files are okay, but binary diffs of .tar.gz files are troublesome (the diffs are correct, but very large).

There's probably a good explanation for this, so I'll have a closer look at it as soon as I have some free time.

dennisvang commented 1 year ago

As a temporary workaround, patches can be disabled using --skip-patches, see PR #68.

On the command line:

tufup targets add --skip-patches <app_version> <bundle_dir> <key_dirs>

or in a script:

...
repo = Repository.from_config()
repo.add_bundle(new_bundle_dir=..., new_version=..., skip_patch=True)
repo.publish_changes(private_key_dirs=...)
...

dennisvang commented 9 months ago

Another problem may be the fact that pyinstaller builds are not reproducible by default, as explained in the docs:

In certain cases it is important that when you build the same application twice, using exactly the same set of dependencies, the two bundles should be exactly, bit-for-bit identical.

That is not the case normally. Python uses a random hash to make dicts and other hashed types, and this affects compiled byte-code as well as PyInstaller internal data structures. As a result, two builds may not produce bit-for-bit identical results even when all the components of the application bundle are the same and the two applications execute in identical ways.

but

You can ensure that a build will produce the same bits by setting the PYTHONHASHSEED environment variable to a known integer value before running PyInstaller. [...]

in addition

Changed in version 4.8: The build timestamp in the PE headers of the generated Windows executables is set to the current time during the assembly process. A custom timestamp value can be specified via the SOURCE_DATE_EPOCH environment variable to achieve reproducible builds.

I'll have to do some more tests...

UPDATE:

Hmm... Does not seem to make much of a difference in the tufup-example app. Setting both PYTHONHASHSEED and SOURCE_DATE_EPOCH produces patches that still vary in size between runs, and are still far too big for the small change (only 1.0 changed to 2.0):

archive size v1/v2: 10846 KB
patch size (default): 7064 KB, 6912 KB, ...
patch size ("reproducible"): 6659 KB, 6975 KB, ...

dennisvang commented 9 months ago

more useful information:

dennisvang commented 9 months ago

Although we can now work around most of the issues with reproducibility with gzip (see #93), one risk remains:

The compressed output from gzip depends on the implementation, and there is no guarantee that identical input will lead to identical output between different implementations. (only equality of decompressed output is guaranteed)

We assume that the tufup archives are created on the same OS that they are used on, and that the gzip implementation is sufficiently stable between versions of the same OS to guarantee byte-for-byte equality. However, this may lead to trouble in the future: If it would turn out that gzip output is unstable between different versions of the same OS, the python-tuf hash check would fail, preventing updates.

There are a few options to prevent this:

Implement support for OS versions in the archive filename, so we can add separate targets for different OSes (or OS versions). This is also in line with multi-platform support as in #79.
Register the .tar archives as targets, instead of .tar.gz archives. This would simplify our code, because we would no longer need to worry about gzip reproducibility. To save disk space on the client we could still keep a compressed archive there using the default gzip. Note that gzip compression could still be used for file transmission, e.g. using the Content-Encoding: gzip HTTP-header, but this would depend on the user's update-server configuration and would therefore be outside the scope of tufup (python-tuf automatically handles decompression if that HTTP-header is set).

dennisvang commented 7 months ago

After some more thought, here's another option:

We stick with compressed archives (.tar.gz) as our tuf repository targets.

This means the download verification process and the server configuration can remain unaltered.

However:

we take precautions to ensure that our (uncompressed) .tar archives are reproducible
we create a (monolithic) patch file using bsdiff4 from the (uncompressed) .tar archives
we include a file hash for the (uncompressed) destination .tar archive in the target metadata for the patch file, using a CUSTOM object (see #100)
after reconstructing the destination archive from the patch, on the client side, we verify its integrity using the hash from the custom metadata object, before gzipping the archive (just to save storage space)

The only problem remaining now is that our uncompressed .tar archives can be two or three times the size of the corresponding .tar.gz files. This may cause trouble due to resource limitations, as bsdiff4 requires a lot of memory (and time).

In addition, we should implement some kind of failsafe, so that failed patches will be ignored on the next run, in favor of a full installation. (done: #101)

Why go to the trouble of verifying the integrity of the reconstructed archive?

The integrity and authenticity of the patch and the current archive are already guaranteed by TUF.

Knowing this, it seems highly unlikely that anything could go wrong when applying the patch.

Nevertheless, if anything does go wrong, our self-updating application is likely to be broken. This would require a manual re-install.

Moreover, it is quite possible that a mistake somewhere in the workflow would lead to a patch being applied to the wrong archive: bsdiff4 will happily apply a patch to any src file, regardless of whether the patch was actually created from that file. Obviously, the result would be unusable.

To illustrate the point:

import bsdiff4

original = b'this represents the original file'
updated = b'this represents the updated file'
wrong = b'this is the wrong file'

patch = bsdiff4.diff(src_bytes=original, dst_bytes=updated)
reconstructed = bsdiff4.patch(src_bytes=original, patch_bytes=patch)
assert reconstructed == updated
broken = bsdiff4.patch(src_bytes=wrong, patch_bytes=patch)
assert broken != updated

dennisvang / tufup

Large patch sizes #69

Why go to the trouble of verifying the integrity of the reconstructed archive?