aptly-dev / aptly

aptly - Debian repository management tool
https://www.aptly.info/
MIT License
2.55k stars 369 forks source link

consider adding git backend #1169

Closed james-lawrence closed 10 months ago

james-lawrence commented 1 year ago

git is fairly universal; many hosting providers; easy to setup secure access to and golang libraries exist for interacting with git repositories.

just a thought; don't personally need this but would make it easier to use aptly given a git host service is in every company.

r4co0n commented 11 months ago

You are right, git is ubiquitous. But git is also a really bad place to store package mirrors, especially the pool which consists solely of compressed files, as you don't delete things from a git branch without rewriting history.

I don't see how git would be a viable storage target - Btw, you can already have your data in git, just install something like etckeeper, configure it to use git and keep track of your aptly base directory. (Only do this for giggles, this is a bad idea, as outlined previously.)

james-lawrence commented 11 months ago

git is perfectly fine for this; been doing using git repositories for ages to host packages. the only reason git would be bad is if you are updating individual binary files. which aptly doesn't do. it generates new ones.

the benefits are fairly straight forward: ubiquitous suppliers and support or self hosted. if you have code your shipping packages then you almost certaintly already have a hosted git infrastructure.

r4co0n commented 11 months ago

We may be talking about different things. I was thinking about mirroring a whole distribution, e.g. Debian Bookworm's main/contrib branches and the corresponding debian-security mirrors. Over the lifetime of a Debian release, there is quite some churn in the repository, new package versions get added.

I am fully aware how aptly handles its pool - You saying "git is perfectly fine for this" is equivalent to saying "You never need to drop/remove an aptly snapshot and run aptly db cleanup" - These mechanisms are there for a reason, and git, as a VCS, is designed to not "forget history".

Also, it is completely unclear how locking would work in the scenario that you describe. When two aptly instances are working with the same base directory concurrently, are we supposed to check-in and commit the lock file?

DEB packages are blobs, if someone wanted to store a large collection of blobs in a git repository, I would recommend git-lfs for that. And when you think about it, that's where the cat got its own tail, as aptly already is able to use comparable storage systems. Git might be interesting to look at old Release files and the like, but making it handle the pool, the db and state still seems wrong to me - I am happy to be convinced otherwise.

james-lawrence commented 11 months ago

the history is fairly immaterial to the problem; you're serving a set of files. the history doesn't play into it. the only reason you care about the total size with the history is when you're deep cloning the repository. but as I've said if you're really concerned about the size its fairly trivial to clean up the history.

git has plenty of tooling for shallow clones of just a point in time.

quickly and efficient cleanup a git repository hosting packages, these commands limit the repository history to 5 commits but will keep all files that are currently available in those 5 commits even if they were created commits that were removed:

git rev-parse HEAD^^^^ > git.sha
git checkout --orphan tmpmain $(cat git.sha)
git commit -m 'release packages'
git rebase --onto tmpmain $(cat git.sha) main
git branch -D tmpmain
git push --force origin HEAD:main

the point is git is:

  1. universal
  2. easy to manage
  3. many hosting providers.
  4. less infrastructure.

im not arguing to replace s3 or other backends. I'm just saying its a perfectly reasonable one to add and use especially when just getting started. once you outgrow git for performance reasons then moving to s3 or some other backend can be done. but i'd love to dump launchpad and not need to setup s3/gcs when git can do the job. it'd make aptly fantastic for hosting personal packages/projects that don't have a lot of infrastructure resources.

r4co0n commented 10 months ago

I still think aptly's users expect aptly snapshot drop <snapshot_name> to remove a snapshot and aptly db cleanup to tidy up all remnants of these snapshots. If git was treated as a storage backend, this is no longer a trivial task, you can't just unlink unreferenced files, but need to rewrite GIT history (git push --force main). So I suppose aptly db cleanup should do your "quick and efficient cleanup" - This is potentially destructive, depending on what other stuff the user tracked in this repository. Our current backends do not have that problem, as we just remove the files we know are ours and no longer referenced.

If you want to use aptly to host a small personal project, I think you can also just commit the resulting publication, and not care about aptly metadata, throw away your db and your instance, and start with a fresh aptly instance for the new release. As long as you keep the same config and use the same GPG key, your new publication version committed to your git repo will look no different than if you kept the db.

james-lawrence commented 10 months ago

I covered rewriting the history above with a trailing window in my previous commit. its something that can be handled by aptly trivially. but your objections (even if i think they're unfounded) are noted and I doubt there will be any willingness to have a git backend closing.