freebsd / pkg

Package management tool for FreeBSD. Help at #pkg on Libera Chat or pkg@FreeBSD.org
Other
746 stars 278 forks source link

Binary Diff for package upgrades #1869

Open darkfiberiru opened 4 years ago

darkfiberiru commented 4 years ago

Currently tracking packages on head every package needs uploaded to pkg server and redownloaded if even a single byte changes. I.E. An abi change cause cdn for packages to completely update to new only identical files. I propose a system similar to differential backups where a binary diff of a package(untarred contents) could be applied and at max you have two "types". Full build of a package from given version and a differential. Either some heuristic or manual intervention is used to setup a new "full" version of the package as necessary.

Steps

  1. Download package via data in digests.
  2. Check if package is differential or full. 3a. If full package continue as normal 3b If Package is a differential download full package(may already be cached on disk)
  3. Apply Binary diff to individual files
  4. Install newly rebuilt pkg

This is especially important for something like cloud storage where uploading new data may have a cost and it should cause smaller updates at least part of time if an already installed packages. I'm willing to start work on this but I know this is a very bikeshedable topic and wanted to get feedback early. Some people have probably tried and failed and I'm very interested in any existing work.

I still need to do research on exact efficiency gained by this method/max memory needed for diffing/compute cycles need to do diff etc.

But to do this safely and securely in a fully signed fashion I believe all the fetching and applying diffs needs done inside pkg itself.

I apologize if this is a duplicate of an existing issue but surprisingly I could not find one

Also this is completely my own goal and work on personal time not connected to my employer.

darkfiberiru commented 4 years ago

Currently my plan is to leave poudriere alone and post process the differential package set. So only pkg would need modified for this. Maybe apply some heuristic like replace the original package if the diff is 40% of the size of full package or if "new" package is 40% smaller prior to even doing the diff.

dch commented 4 years ago

As a first step what about trying to find a suitable binary diff approach for some of the really large compiler packages like llvm or rust where the compilation time is very long?

A multi-GB rsync of compressed content is a tricky thing

darkfiberiru commented 4 years ago

@dch Not quite sure what your asking for here as this is tied to compilation times this is about creating binary diffs for packages on a per file basis within the packages. The largest file inside off llvm 90 package is about 70 MB so bsdiff needing 18 x the space to perform a binary diff means you only need 1.34 GB of ram.

Alot of work can also be done to determine when a binary diff is even needed. If the file sizes match and md5 matches then for example no diff is needed but if it turns out all that changed was an abi in an elf binary then hopefully we get a "small" binary diff"

Some kind of diff manifest would be needed inside of the differential pkg to explain how to apply it to the "full packacge" when to out right delete files, copy over brand new files apply binary patches, apply text patches(maybe helpful not sure). I think an interative approach to development is helpful to figiure out what optimizations are needed after scoping and sketching out a frame of a plan.

darkfiberiru commented 4 years ago

@bapt @bdrewery Any thoughts

lin7sh commented 3 years ago

Should we consider —patch-from builtin support is the recent zstd release? there's a 4G limit for now though