Closed mchaniotakis closed 6 months ago
@mchaniotakis Thanks for providing such a detailed report.
You are right, these excessively large patches for small changes are not very useful, to say the least.
Tufup was created as a replacement for PyUpdater (because PyUpdater is no longer maintained). For this reason, the patch creation in tufup using bsdiff4
is basically a naive copy of PyUpdater's make_patch
(see inputs here).
Although I did add some tests for basic patch functionality, I must admit, I haven't paid very much attention to the resulting file sizes.
The use of bsdiff4, in itself, does not seem to be a problem. Rather, the problem comes from the fact that we use it, naively, to create binary differences of .tar.gz
archives.
It appears that binary diffs of either uncompressed .tar
files or non-tar .gz
files are okay, but binary diffs of .tar.gz
files are troublesome (the diffs are correct, but very large).
There's probably a good explanation for this, so I'll have a closer look at it as soon as I have some free time.
As a temporary workaround, patches can be disabled using --skip-patches
, see PR #68.
On the command line:
tufup targets add --skip-patches <app_version> <bundle_dir> <key_dirs>
or in a script:
...
repo = Repository.from_config()
repo.add_bundle(new_bundle_dir=..., new_version=..., skip_patch=True)
repo.publish_changes(private_key_dirs=...)
...
Another problem may be the fact that pyinstaller builds are not reproducible by default, as explained in the docs:
In certain cases it is important that when you build the same application twice, using exactly the same set of dependencies, the two bundles should be exactly, bit-for-bit identical.
That is not the case normally. Python uses a random hash to make dicts and other hashed types, and this affects compiled byte-code as well as PyInstaller internal data structures. As a result, two builds may not produce bit-for-bit identical results even when all the components of the application bundle are the same and the two applications execute in identical ways.
but
You can ensure that a build will produce the same bits by setting the PYTHONHASHSEED environment variable to a known integer value before running PyInstaller. [...]
in addition
Changed in version 4.8: The build timestamp in the PE headers of the generated Windows executables is set to the current time during the assembly process. A custom timestamp value can be specified via the SOURCE_DATE_EPOCH environment variable to achieve reproducible builds.
I'll have to do some more tests...
UPDATE:
Hmm... Does not seem to make much of a difference in the tufup-example app. Setting both PYTHONHASHSEED
and SOURCE_DATE_EPOCH
produces patches that still vary in size between runs, and are still far too big for the small change (only 1.0
changed to 2.0
):
more useful information:
Although we can now work around most of the issues with reproducibility with gzip
(see #93), one risk remains:
The compressed output from gzip depends on the implementation, and there is no guarantee that identical input will lead to identical output between different implementations. (only equality of decompressed output is guaranteed)
We assume that the tufup archives are created on the same OS that they are used on, and that the gzip implementation is sufficiently stable between versions of the same OS to guarantee byte-for-byte equality. However, this may lead to trouble in the future: If it would turn out that gzip output is unstable between different versions of the same OS, the python-tuf
hash check would fail, preventing updates.
There are a few options to prevent this:
.tar
archives as targets, instead of .tar.gz
archives. This would simplify our code, because we would no longer need to worry about gzip reproducibility. To save disk space on the client we could still keep a compressed archive there using the default gzip. Note that gzip compression could still be used for file transmission, e.g. using the Content-Encoding: gzip
HTTP-header, but this would depend on the user's update-server configuration and would therefore be outside the scope of tufup (python-tuf
automatically handles decompression if that HTTP-header is set).After some more thought, here's another option:
We stick with compressed archives (.tar.gz
) as our tuf repository targets.
This means the download verification process and the server configuration can remain unaltered.
However:
.tar
archives are reproduciblebsdiff4
from the (uncompressed) .tar
archives.tar
archive in the target metadata for the patch file, using a CUSTOM object (see #100)The only problem remaining now is that our uncompressed .tar
archives can be two or three times the size of the corresponding .tar.gz
files. This may cause trouble due to resource limitations, as bsdiff4
requires a lot of memory (and time).
In addition, we should implement some kind of failsafe, so that failed patches will be ignored on the next run, in favor of a full installation. (done: #101)
The integrity and authenticity of the patch and the current archive are already guaranteed by TUF.
Knowing this, it seems highly unlikely that anything could go wrong when applying the patch.
Nevertheless, if anything does go wrong, our self-updating application is likely to be broken. This would require a manual re-install.
Moreover, it is quite possible that a mistake somewhere in the workflow would lead to a patch being applied to the wrong archive: bsdiff4
will happily apply a patch to any src
file, regardless of whether the patch was actually created from that file. Obviously, the result would be unusable.
To illustrate the point:
import bsdiff4
original = b'this represents the original file'
updated = b'this represents the updated file'
wrong = b'this is the wrong file'
patch = bsdiff4.diff(src_bytes=original, dst_bytes=updated)
reconstructed = bsdiff4.patch(src_bytes=original, patch_bytes=patch)
assert reconstructed == updated
broken = bsdiff4.patch(src_bytes=wrong, patch_bytes=patch)
assert broken != updated
Describe the bug I have generated a Version 1 of a python application buddled with pyinstaller. This package contains images, libraries, the .exe and my .py files that have been converted to .pyd (binaries). One of those .pyd files states the version of the file. If only change the version of that .pyd file without running pyinstaller again to generate the second version of the bundle with tufup I get a file difference of 200MB, which is crazy if you take into account that the whole package is 340MB. The last modification date for these files are the same except the .pyd file file that states the version. Using the bsdiff4.file_diff() method between these two version produces the same result. I can provide both of these files If needed.
To Reproduce Steps to reproduce the behavior:
Expected behavior A patch size that is less than 10 MB. On a previous run, I regenerated just the .exe (running pyinstaller and just copying the .exe and deleting everything else while I follow the steps mentioned above.) The .exe filesize is 17mb while the generated patch was 35MB for that run.
System info (please complete the following information):