Cluster usage: HTTP mirror vs pacman cache mount

guysv commented 4 years ago

As discussed in couple of other issues, a cool usage for this cluster could be pointing pacman to IPFS to download new packages during upgrades.

There's this implementation detail I would like to discuss: how should pacman integrate with IPFS.

By reading your docs and other issues, I understand your preferred way to do this is to mount the cluster repo on pacman's cache dir (/var/cache/pacman/pkg), forcing pacman to get the package from IPFS during the step where it scans its cache dir to see if the package is already present.

The way I vision doing that, is modeling the repo tree after the way traditional mirrors are structured, setting Server = http://127.0.0.1:8080/ipns/pkg.pacman.store/$repo/os/$arch in my /etc/pacman.d/mirrorlist, and using pacman as usual.

Some pros and cons:

pros

It's possible to fall back to a traditional mirror if the IPFS mirror underperforms I want to be able to use IPFS to download packages, but not if it performs worse than regular mirrors. By setting traditional mirrors in my mirrorlist after my local IPFS gateway, pacman will switch to the old mirrors if the IPFS gateway times out
Security wise, the local IPFS node is contained as a low-privileged daemon, no IPFS operation requires root (e.g for the FUSE mount) As IPFS is still an experimental technology, I will sleep better at night knowing high privileges don't come near my IPFS node.
Opportunity to use a remote IPFS node. The way I plan to use the cluster repo, is to replicate the cluster into a server in my home network, then setting it as a mirror the couple of arch machines I have in my LAN. This way, IPFS does not even need to be installed on my arch machines, as I'll set the mirrorlist as: Server = http://localarchclusterpeer/ipns/pkg.pacman.store/$repo/os/$arch

cons

spending twice the space for each package installed Getting packages from a local node will cache them in the node's repo. Because every package that's installed by pacman also ends up in /var/cache/pacman/pkg, we find ourselves spending twice the space for each package.
- Considering the fact that the number of packages a typical arch machine actually has installed is pretty low (1000~), that's not much of a big deal.
- Note that now the pacman cache actaully acts the way it should, as a cache to prevent pacman from causing extra network movement for packages that are already present.
On the surface, this opens a door to leeching the cluster more easily. All you need to do to use our cluster is to set Server = https://ipfs.io/ipns/pkg.pacman.store/$repo/os/$arch on one's mirrorlist
- I'd argue that there's no real incentive to do that while our cluster is small and low-powered, as such user will see a better upgrade performance by just sticking to the regular mirrors. In the day our cluster will be scattered across many peers around the world, I think we should then welcome leeches. There's more chance leeches will become productive cluster replicators than others passing by, and in the end, I replicate the cluster to give a service, and I would prefer to drown in users rather than not serving at all.
- If you, the cluster master does not serve an IPFS gateway, the load should be balanced well between us cluster peers.

RubenKelevra commented 4 years ago

Thanks for splitting this up in a dedicated ticket. Makes more sense to discuss it not in a closed ticket :)

By reading your docs and other issues, I understand your preferred way to do this is to mount the cluster repo on pacman's cache dir (/var/cache/pacman/pkg), forcing pacman to get the package from IPFS during the step where it scans its cache dir to see if the package is already present.

This is correct, but pacman needs a writeable path for packages. I would just add a non-writeable path in front of the usual cache dir, what pacman perfectly supports. Pacman would search one directory at a time until the package was either found or not.

It's possible to fall back to a traditional mirror if the IPFS mirror underperforms I want to be able to use IPFS to download packages, but not if it performs worse than regular mirrors. By setting traditional mirrors in my mirrorlist after my local IPFS gateway, pacman will switch to the old mirrors if the IPFS gateway times out

True, the speed of IPFS might be slower than a traditional mirror. But it's somewhat unlikely since you can download from all cluster members plus all clients which have already the package downloaded. So it's more likely that IPFS will be able to fully utilize your connection, regardless of how fast it is. Additionally, the current speed will largely improve, since IPFS made major improvements with the help of Netflix last week: https://blog.ipfs.io/2020-02-14-improved-bitswap-for-container-distribution/

The import server was able to handle somewhat around 60 MByte/s via IPFS in my testing. So it should be enough to get new packages fast distributed in the cluster.

Security-wise, the local IPFS node is contained as a low-privileged daemon, no IPFS operation requires root (e.g for the FUSE mount) As IPFS is still an experimental technology, I will sleep better at night knowing high privileges don't come near my IPFS node.

Mounting the IPFS won't use any superuser rights. IPFS won't run as a superuser, but with a system user account. That's how the cluster and the ipfs running on the import server is currently configured.

I fully understand your sleeping issues, when IPFS would run as root.

Opportunity to use a remote IPFS node. The way I plan to use the cluster repo, is to replicate the cluster into a server in my home network, then setting it as a mirror the couple of arch machines I have in my LAN. This way, IPFS does not even need to be installed on my arch machines, as I'll set the mirrorlist as: Server = http://localarchclusterpeer/ipns/pkg.pacman.store/$repo/os/$arch

You don't have to replicate the cluster into your home network. You just have to run an IPFS node to get this functionality. Running a cluster node will precache the whole mirror server with every little update (even staging and unstable stuff) this might increase your bandwidth use.

But, updates will be very fast. :)

The best solution IMHO is to run an IPFS client on every machine, they will automatically find each other via mDNS and I plan to write a small script that will automatically pin the updates which are needed on the machine next. So the download in the IPFS cache will happen as soon as new updates are available and you won't need to download any package you don't have installed in your network.

That said, I appreciate that you're running a cluster member :)

Note that now the pacman cache actaully acts the way it should, as a cache to prevent pacman from causing extra network movement for packages that are already present.

IPFS will also cache the packages you have installed. There's no network activity involved to "download" the packages from your local IPFS cache, since they are already present.

If you're running out of disk space (for the IPFS cache) it will clean up the cache, to get some space back automatically. So you can set the cache size to any value you like, and the cleanup process will automatically prevent it from getting bigger.

So there's no 100 GB space requirement to get updates from IPFS, if you think that's the case.

I would recommend like 5-10 GB of cache when you install large packages over IPFS. If not 2 GB should be sufficient.

If you, the cluster master does not serve an IPFS gateway, the load should be balanced well between us cluster peers.

A role that I don't want to have in the first place. I like to move the cluster to trusted arch users or arch developers in the long term, to give a second update channel for users, next to regular mirrors.

Having me as an importer for the cluster isn't an ideal condition. I like to keep the service the best working I can, so maybe I'll add a second or a third server on different locations, but in the long term, I like to get the liability removed from my person to be the guy importing the arch updates into IPFS. If you know what I mean. :)

RubenKelevra commented 4 years ago

@guysv are you fine with me going ahead, implementing my idea to the end & pushing this to the backlog until I'm finished?

I think I don't have to the time to implement multiple approaches at the same time.

We can discuss afterwards, what needs improvements and what makes sense to change or add alternative concepts.

guysv commented 4 years ago

Yes. Lets give the mount approach a try :)

RubenKelevra / pacman.store

Cluster usage: HTTP mirror vs pacman cache mount #15

pros

cons