dashmage / lp-charm-apt-mirror

GNU General Public License v3.0
0 stars 0 forks source link

Cleanup action removes packages still in use #1

Closed dashmage closed 8 months ago

dashmage commented 8 months ago

We found that the automatically-run cleanup action on a synchronise can remove packages that are still referenced by existing Packages lists. This was observed by deb files in all repositories except for "main" being removed, after running the synchronize action.

I believe it is due to [1] in the locate_package_indicies function. The line of code correctly iterates over all dists, but it then takes only the first Packages* file it can find, alphabetically. This just happens to be in the main repository, because universe, multiverse, restricted are all alphabetically larger.

[1] https://github.com/canonical/charm-apt-mirror/blob/23cd8fa1296d2ce557a39a00d4615760699827e4/src/utils.py#L86

On a related note, this destroyed the snapshots as well as the main mirror because they symlink to the pool directory. This was very surprising! I guess the symlink is to avoid the snapshots taking up unnecessary space. Can I suggest an alternative to use hardlinks for the deb files in the snapshots? This way an unexpected deletion wouldn't happen, and no unnecessary disk space would be used.


Imported from Launchpad using lp2gh.

dashmage commented 8 months ago

(by vultaire) I've also reviewed this. I agree with Danny's findings. The code was likely written the way it was because there are often Packages, Packages.bz2 and Packages.gz files (maybe even Packages.xz as well?), and these all have the same real content, thus it makes sense to only parse one copy. However, the code in question is not actually looking just at the packages files in the same immediate directory, but across multiple directories within the same mirrored repository subfolder - thus the code unfortunately ends up discarding far more than it should.

dashmage commented 8 months ago

(by vultaire) I thought I had some code which could be used here, but upon further reflection my repository "health check" code actually ends up parsing the compressed Packages files as well as the uncompressed ones - thus it's not really a good reference for correcting the broken logic here.

I think the key thing is: for each directory which contains package files, at least one Package file should be parsed. Rather than searching for all Packages files within a “dists” folder, it’s necessary to recurse to the directories within the dists folder where the Packages files are directly contained.

Or, alternatively (and perhaps a better long term solution): when we create a snapshot, rather than the current method of symlinking the “pool” subdirs and copying everytihing else, it’s probably better to do a “cp -r --link”-style copy of the pool.. That is, for files within the pool directory, create hard links rather than normal copies. This has the benefit of cleanup being implicit via the filesystem - the hard-linked “copies” don’t take up any extra disk space beyond the little that the hard links themselves consume in the filesystem, and snapshot removal simply becomes a “rm -rf” of the snapshot directory.

…The above all stated, the first method is probably the simpler one to do with the current code.