etix / mirrorbits

Mirrorbits is a geographical download redirector written in Go for distributing files efficiently across a set of mirrors.
MIT License
497 stars 90 forks source link

mirrorbits for linux distributions (rapidly changing metadata files) #85

Open stormi opened 5 years ago

stormi commented 5 years ago

As discussed on IRC, in the context of a linux distribution, repository metadata files can change quite often, and when they change it can cause a delay during which no mirror can serve those files (unless you provide a least one mirror that syncs instantly).

It can also happen that a user with a slightly older repository metadata cache tries to install a file from the mirrors and get an error because the file does not exist anymore in mirrorbits local reference, and there's no grace delay to let the cache expire (usually a few hours). It might be preferrable to let the request reach one of the mirrors that have not synced yet and that still have the file.

A few leads (may contain very bad ideas!):

PalinuroSec commented 4 years ago

ping any updates?

ott commented 1 year ago

Perhaps it would be best if the origin server would also keep old files for as long as it expects that clients have not updated their metadata. This would remove this burden from the mirrors and the download redirector.

elboulangero commented 8 months ago

@stormi I'd be curious to know if/how you solved it for XCP-ng. I've been looking at the same issue for Kali Linux. Quoting what you said at the time:

an option to "remember" deleted files for a while and keep serving them if mirrors still have them + accepting to serve an old version of some files (repomd.xml and repomd.xml.asc for yum repositories for example) if no mirror (or no mirror close enough) has the new version?

For Kali, we don't need to « "remember" deleted files for a while », in the sense that we solve this issue with reprepro, the tool that generates the repository. When a package is replaced by a new one, we keep the old package around in the archive for a few days. Hence this problem doesn't need to be solved at the Mirrorbits level.

However, I definitely observed the issue that you mentioned with metadata (the files that are updated in place). When we push an update of the repo, Mirrorbits will be quickly aware of the new version of the metadata files, and since it didn't rescan the mirrors yet (and maybe the mirrors didn't even sync anyway), it can't redirect, and it goes in fallback mode for those files.

« accepting to serve an old version of some files » seems to a good solution for Kali, so I implemented this feature in https://github.com/etix/mirrorbits/pull/147.

stormi commented 8 months ago

@elboulangero no, we haven't solved it. Thankfully, the metadata on non-testing repositories doesn't change often, so issues are rare.

elboulangero commented 8 months ago

@stormi I'd like to improve the MR https://github.com/etix/mirrorbits/pull/147 so that it would work for RPM repos as well.

As I said quickly above, the idea with this MR is to tell mirrorbits to accept serving old versions of some files, and within a certain time limit (when files are really too old, mirrorbits will stop serving it).

So far, the setting I proposed is pretty crude, as the only matching option is a prefix. It works for Kali, as all I want to do is to match requests paths that start with /dists/, and allow files under this prefix to be outdated.

Now, how would that go for the XCP-ng repo, what outdated files do you need to match? I had a quick look, it seems like we could match /repodata/ anywhere in the request path. Or, be stricter, and match the files repomd.xml and repomd.xml.asc. Or maybe repomd.xml.*$, trying to future-proof a bit. What do you prefer? Are those the only metadata files to match, or are there others?

As you rightfully pointed out, from the moment we allow mirrorbits to serve old metadata, the next issue is that clients that get these old metadata will also request old files, that might not be on the repo anymore. (NB: Mirrorbits won't serve files that are not on the local repo, it doesn't matter if those files are still on the mirrors).

You suggested that Mirrorbits could try to keep track of deleted files for a while. Now that I'm familiar with the code of mirrorbits, I'd prefer to avoid this route. Ok, I'm a bit biased as for my own use-case (Kali), we already solve this issue outside of Mirrorbits. But still, I wonder if you could look at the options you have with the tool you use to create and manage your RPM repository. Is there any option to snapshot a repository?

I suggest the idea of "snapshot" because that's how we do it for the Kali repo, with reprepro. Every time we update the repo (4 times a day, as Kali is a rolling distro), we take a snapshot of the distro. We keep something like the last 10 snapshots. It means that after packages are removed from Kali rolling, they still linger around for 2.5 days, as the snapshots still hold a reference to it.

Can you take the same approach for your RPM distro?

@lazka Please allow me to pull you in the discussion, as you're maintaining a Arch-based distro it seems, and I'd like to also have your feedback. Do you have the same kind of issue, to start with?

lazka commented 8 months ago

@lazka Please allow me to pull you in the discussion, as you're maintaining a Arch-based distro it seems, and I'd like to also have your feedback. Do you have the same kind of issue, to start with?

We only upload ~1 a day, and the only metadata change there is that the database files change, which amounts to ~12MB. While that means all clients will pull from the main server, it hasn't been a problem so far (at least no one complained). We don't have that many users, and most traffic comes from downloading packages which this doesn't affect, also we have enough mirrors that things get in sync quite fast. So it's definitely not a problem traffic wise, but might result in sluggish database syncs for some far away users for a bit. There is also an upcoming change in pacman where package signatures will be moved out of the database files, which will reduce the metadata size by ~50%.

As for trying to fetch no longer existing files: We keep all packages for >1.5 years before we prune them, so this isn't really an issue for us.

tl;dr: we don't have that many packages or users for this to be a big problem.

The only potential problem I see with serving existing files from outdated mirrors is that two DB syncs in a short period might lead to pacman doing package downgrades, if it happens to hit an old mirror after a fresh one, which we don't support really.

elboulangero commented 8 months ago

Ack, thanks very much for your detailed reply @lazka!

The only potential problem I see with serving existing files from outdated mirrors is that two DB syncs in a short period might lead to pacman doing package downgrades, if it happens to hit an old mirror after a fresh one, which we don't support really.

Ah Ok. This is not a problem on Debian's side, as apt will silently discard a Release file that is older than the local one. So if we hit an old mirror after a fresh one, from apt point of view it just means that the system is up-to-date.

elboulangero commented 8 months ago

Something else I wanted to share in this discussion: the methodology (and scripts) I used to monitor the availability of some files.

In short:

And here's the result, requesting the InRelease file (ie. the first metadata file that is requested by apt update), every minute during a day.

InRelease

What we clearly see above is that, after the sync of 18:00 and the sync of 06:00, for a while the InRelease file was served in fallback mode (the pinkish vertical bars). We see that suddenly, all mirrors are excluded (due to mod time mismatch the first time, and file size mismatch the second time). Then slowly, this number decreases, as the mirrors are synced, and mirrorbits scan it.

I don't know why the number of returned mirrors goes way above 4, and then drop suddenly to 4 at some point. I'm sure this can be explained by a careful reading of the selection algorithm...

Anyway. So if someone wants to do the same check and produce a similar graph, I pushed the scripts at: https://gitlab.com/kalilinux/tools/mirrorbits-scripts/-/tree/main/check-availability. It's very straightforward to use it, there's even a README!

stormi commented 3 months ago

@stormi I'd like to improve the MR #147 so that it would work for RPM repos as well.

As I said quickly above, the idea with this MR is to tell mirrorbits to accept serving old versions of some files, and within a certain time limit (when files are really too old, mirrorbits will stop serving it).

So far, the setting I proposed is pretty crude, as the only matching option is a prefix. It works for Kali, as all I want to do is to match requests paths that start with /dists/, and allow files under this prefix to be outdated.

Now, how would that go for the XCP-ng repo, what outdated files do you need to match? I had a quick look, it seems like we could match /repodata/ anywhere in the request path. Or, be stricter, and match the files repomd.xml and repomd.xml.asc. Or maybe repomd.xml.*$, trying to future-proof a bit. What do you prefer? Are those the only metadata files to match, or are there others?

Hi! Sorry for the late reply. So, as I understand it, the problem is that most filenames contain unique identifiers in repodata, so serving an old version of repomd.xml wouldn't solve anything: it would refer to the old filenames, which mirrorbits doesn't know about anymore. That's why I suggested remembering old files for a while. Not RPMs: I agree with other distro maintainers, keeping old RPMs is the distro's responsibility, and we do keep all updates we released in the repositories.

See the current contents of one of the repodata directories:

0405c825b877bd049254a99576e927ad7fcaa3200ff425181caca31369720c0c-primary.sqlite.bz2
09ce9e87374c09e1a42a226cd93a56892e9485f9d0dd90af33fc0203a23eac37-other.sqlite.bz2
ada1055a6861676ef0ebdd75bfd1d0481057e6b3c89d99286b829ffe80248443-filelists.sqlite.bz2
e9b0b77f0410d8a3e7016f0de434ad3afc3f9155c2ea7fb234b8d61931dbb8c4-primary.xml.gz
f3794ff0b31ed187c30b443d891d02d593e73c08783fea45eb07b7a6471aae64-filelists.xml.gz
fc85f0bc20acb3da728b13b394d0f4337a16702171495daf41f251f64656d1d8-other.xml.gz
repomd.xml
repomd.xml.asc

And the the contents of repomd.xml which references them:

<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
  <revision>1709307305</revision>
  <data type="primary">
    <checksum type="sha256">e9b0b77f0410d8a3e7016f0de434ad3afc3f9155c2ea7fb234b8d61931dbb8c4</checksum>
    <open-checksum type="sha256">8e7d4d3311b470f54c43ec1958decfe9760d413abf2e76d1093775b2a117b7f5</open-checksum>
    <location href="repodata/e9b0b77f0410d8a3e7016f0de434ad3afc3f9155c2ea7fb234b8d61931dbb8c4-primary.xml.gz"/>
    <timestamp>1709307299</timestamp>
    <size>2114338</size>
    <open-size>14535196</open-size>
  </data>
  <data type="filelists">
    <checksum type="sha256">f3794ff0b31ed187c30b443d891d02d593e73c08783fea45eb07b7a6471aae64</checksum>
    <open-checksum type="sha256">95b07d6edbf1e3e2261095112f20f75057257e447f360a523f0e7ec6180821ce</open-checksum>
    <location href="repodata/f3794ff0b31ed187c30b443d891d02d593e73c08783fea45eb07b7a6471aae64-filelists.xml.gz"/>
    <timestamp>1709307299</timestamp>
    <size>7190267</size>
    <open-size>100075554</open-size>
  </data>
  <data type="other">
    <checksum type="sha256">fc85f0bc20acb3da728b13b394d0f4337a16702171495daf41f251f64656d1d8</checksum>
    <open-checksum type="sha256">d5b576c83da54d59e9d001cd5f2d1fa4c68afcf9081bbe445cece9b3ff45828d</open-checksum>
    <location href="repodata/fc85f0bc20acb3da728b13b394d0f4337a16702171495daf41f251f64656d1d8-other.xml.gz"/>
    <timestamp>1709307299</timestamp>
    <size>1071736</size>
    <open-size>9827266</open-size>
  </data>
  <data type="primary_db">
    <checksum type="sha256">0405c825b877bd049254a99576e927ad7fcaa3200ff425181caca31369720c0c</checksum>
    <open-checksum type="sha256">f821fe8f705479bfcc4dfaa61e2b5ac97da56835a0bcbbbc573e825060596992</open-checksum>
    <location href="repodata/0405c825b877bd049254a99576e927ad7fcaa3200ff425181caca31369720c0c-primary.sqlite.bz2"/>
    <timestamp>1709307302</timestamp>
    <size>3228605</size>
    <open-size>16738304</open-size>
    <database_version>10</database_version>
  </data>
  <data type="filelists_db">
    <checksum type="sha256">ada1055a6861676ef0ebdd75bfd1d0481057e6b3c89d99286b829ffe80248443</checksum>
    <open-checksum type="sha256">cc7eef41a3dc1cafad950c57345784f8ffe7fae6f2118e5a28638885cae1e818</open-checksum>
    <location href="repodata/ada1055a6861676ef0ebdd75bfd1d0481057e6b3c89d99286b829ffe80248443-filelists.sqlite.bz2"/>
    <timestamp>1709307305</timestamp>
    <size>7294996</size>
    <open-size>43271168</open-size>
    <database_version>10</database_version>
  </data>
  <data type="other_db">
    <checksum type="sha256">09ce9e87374c09e1a42a226cd93a56892e9485f9d0dd90af33fc0203a23eac37</checksum>
    <open-checksum type="sha256">8ed8afa59eef48afd6e761d9d8f67d46a024ff788c85b23726cb2d447188684b</open-checksum>
    <location href="repodata/09ce9e87374c09e1a42a226cd93a56892e9485f9d0dd90af33fc0203a23eac37-other.sqlite.bz2"/>
    <timestamp>1709307302</timestamp>
    <size>1255313</size>
    <open-size>9761792</open-size>
    <database_version>10</database_version>
  </data>
</repomd>

The next time we regenerate the medata, filenames will change.

elboulangero commented 3 months ago

If I understand correctly: yum (or is it dnf?) downloads the repomd.xml first, and then it might download the other files listed in repomd.xml? Question is: does it hit the redirector again to download those files?

I ask for comparison with apt. Here's how it works for apt update: it first downloads the Release file, and then after it downloads some other files listed in the release file. The key thing is: apt doesn't hit the redirector again for those files, it requests it from the same mirror that served the release file. In other words: during a apt update transaction, all metadata files are downloaded from the same mirror.

stormi commented 3 months ago

If I understand correctly: yum (or is it dnf?) downloads the repomd.xml first, and then it might download the other files listed in repomd.xml? Question is: does it hit the redirector again to download those files?

I think it does hit the redirector again, because it is not aware there is any redirector at all, with mirrorbits. This is the big difference with other mirror management software that distros may use, be it with yum/dnf, apt or other, and is the very reason why I opened this issue: mirrorbits doesn't give you a mirror URL that you can then use for subsequent requests. It redirects every single request directly, via HTTP headers, in an attempt to 1. balance load better, file by file, and 2. always redirect to a mirror which has the right version of the requested file (as I understand the motives). A given mirror might be eligible for some files but not for others, because it only partially synced, or has some outdated files. Mirrorbits may then redirect you to the partial mirror, closer to your location, for some files, and to other mirrors for the rest.

Now maybe I'm wrong and there's some logic in dnf that detects there was a HTTP redirection and then bypasses the very URL that we asked it to download from (mirrorbits), but I doubt it. Are you sure apt wouldn't do the same in a similar situation?

elboulangero commented 2 months ago

Sorry for being late, I missed your reply.

Are you sure apt wouldn't do the same in a similar situation?

100% sure, let me detail.

First, we can easily log the requests that are sent by apt. So here's a apt update transaction that is sent to mirrorbits. I filtered a bit the output for clarity:

┌──(root㉿carbon)-[/work/tmp]
└─# apt -y -q -o Debug::Acquire::http=true update 2>&1 | grep -E '^(GET|Host:|Answer|HTTP)'
GET /kali/dists/kali-rolling/InRelease HTTP/1.1
Host: http.kali.org
Answer for: http://http.kali.org/kali/dists/kali-rolling/InRelease
HTTP/1.1 302 Found

GET /kali/dists/kali-rolling/InRelease HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/InRelease
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/main/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/main/binary-amd64/Packages.gz
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/non-free/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/non-free/binary-amd64/Packages.gz
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/non-free-firmware/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/non-free-firmware/binary-amd64/Packages.gz
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/contrib/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/contrib/binary-amd64/Packages.gz
HTTP/1.1 200 OK

To translate that to words:

  1. Get InRelease from http.kali.org (aka. mirrorbits)
  2. Mirrorbits returns a 302 to mirror kali.cs.nycu.edu.tw
  3. Get InRelease from kali.cs.nycu.edu.tw
  4. Then get 4 Packages.gz files that are referenced in the InRelease file, straight from kali.cs.nycu.edu.tw. Not hitting mirrorbits.

It was implemented in apt in this commit: https://salsa.debian.org/apt-team/apt/-/commit/9b8034a9fd40b4d05075fda719e61f6eb4c45678 (back in 2016)