chaotic-aur / toolbox

Unified kit with all the scripts required for maintaining the repository 🧰
https://aur.chaotic.cx
GNU Affero General Public License v3.0
81 stars 11 forks source link

Delta update support (pacman) #40

Closed RubenKelevra closed 2 years ago

RubenKelevra commented 2 years ago

Hey guys,

I did some testing on how efficient delta updates (if implement) would be in pacman. And since you're having one of the biggest packages around, I thought you might be interesting to take a look at the findings and take part in the discussion:

https://lists.archlinux.org/pipermail/pacman-dev/2022-May/025568.html

The numbers are super promising, with an average saving of 40% on "source code heavy" packages and sometimes above 99% for data heavy packages with applying times below 1 second on modern computers.

Best regards,

Ruben

Technetium1 commented 2 years ago

Interesting concept, but not sure that it'll be implemented or until it's more widely used. The more I read about it the more it seems that was killed off because no official repos used it. Thanks for posting, looking forward to reading others' views. https://www.reddit.com/r/archlinux/comments/b7zkg5/why_delta_support_removed_from_pacman/

PedroHLC commented 2 years ago

We have packages like mesa-*-git that update every hour, others that don't do so much, but their diff is quite big. If we save like "a month-old" of delta packages, we would need at least four times the amount of space we're using now. For our main node that wouldn't be a problem but most of our mirrors are very low-tier VPS running on small SSDs.

RubenKelevra commented 2 years ago

@Technetium1 it was killed because there were security issues with the implementation. A forked database file (which isn't signed) could install an arbitrary package via a delta patch and run commands as root.

My idea includes an additional signature for the unpacked archive, so there are effectively two signature files. And after the patch would be applied, the second signature would confirm that the patch is valid.

This allows to create patches on demand or automated on a repository server without bothering the maintainer of the packages with their creation.

Zstd patches are also EXTREMELY fast compared to the previous approach, so even on older computers it's viable to use them if you don't have at least 100 MBit/s download speed.

You can test out the efficiency yourself:

zstd --patch-from=package-old_version.pkg.tar package-new_version.pkg.tar package-old_version_to_new_version.pkg.tar.zst_delta

to create it.

Then compare it to the size of package-new_version.pkg.tar.zst

To apply the patch decompress the package-old_version.pkg.tar.zst on a different machine and fetch the old_version_to_new_version.pkg.tar.zst_delta.

Then run

zstd -d --patch-from=package-old_version.pkg.tar old_version_to_new_version.pkg.tar.zst_delta -o package-new_version.pkg.tar
RubenKelevra commented 2 years ago

@PedroHLC wrote:

We have packages like mesa-*-git that update every hour, others that don't do so much, but their diff is quite big.

I've tested this:

mesa-tkg-git-22.2.0_devel.153443.a7f44b62694-1-x86_64.pkg.tar to mesa-tkg-git-22.2.0_devel.153445.d2ab0ed31e1-1-x86_64.pkg.tar would be 3.8M and mesa-tkg-git-22.2.0_devel.153445.d2ab0ed31e1-1-x86_64.pkg.tar to mesa-tkg-git-22.2.0_devel.153609.2b28668d1da-1-x86_64.pkg.tar would be 22M.

So that's a saving of 92.8% and 58.5%.

And yes, that's maybe not something every mirror wants to store. But on the other hand, that's fully optional – you can just create two tiers.

I'm for example happy to store 2-3 days worth of deltas on my mirrors, if this allows users to install updates more often or faster.

PedroHLC commented 2 years ago

And yes, that's maybe not something every mirror wants to store. But on the other hand, that's fully optional – you can just create two tiers.

Good idea!

Zstd patches are also EXTREMELY fast compared to the previous approach

I didn't know zstd had this feature aboard.

JustTNE commented 2 years ago

I had a talk with @jonathonf on telegram in the past discussing zsync2 based delta downloads for packages. Zsync2 has the advantage of being able to do delta updates between any 2 files without a delta file, but the disadvantage is that it has to scan the original file before it is able to start a delta download. This would definitely be the lower effort approach to take for us, but I doubt the client speed numbers would be the same. This also uses multiple HTTP range requests instead of just applying a simple single delta file, so this also suffers in the download speed department too.

PedroHLC commented 2 years ago

I doubt this is is anyone's priority list to implement, so let me close this. In case someone is interested feel free to open a PR.

nyabinary commented 2 years ago

No idea how to do this but I would be interested into this being implemented