etix / mirrorbits

Mirrorbits is a geographical download redirector written in Go for distributing files efficiently across a set of mirrors.
MIT License
503 stars 91 forks source link

Health-Check might check for non-existing files, marking mirror down by mistake. Could be improved #152

Open elboulangero opened 11 months ago

elboulangero commented 11 months ago

Background - how does health-check work

How health-check works? It's done every minute by default. Mirrorbits does a HTTP request (HEAD) on a random file served by the mirror. If the request is successful, mirror is marked as up, otherwise mirror is marked as down.

To go a bit more in depth: Mirrorbits gets the file from the hash HANDLEDFILES_<mirror-id>. This hash contains the files that are 1) on the mirror and 2) on the local repo (the source, in mirrorbits-speak). So it's the intersection between these two sets. It means that the HANDLEDFILES hash doesn't contain extra files that would be on the mirror but not in the source (if any), and doesn't contain files that are in the source but not on the mirror (if any). So it should be pretty good at picking a suitable file

The value of HANDLEDFILES_<mirror-id> is updated every time a mirror scan completes.

Issue - the theory

The issue lies with the last line. Every time a mirror is updated, HANDLEDFILES will be outdated, until the mirror is scanned. Assuming mirrors are scanned every hour, then there's a window of one hour at most during which HANDLEDFILES contain files that might not be on the mirror anymore. If the health-check picks one of those files, the mirror will return 404, and the mirror is marked as down. If the health-check picks a file that is still on the mirror, all good, the mirror is up. Assuming health-check is done every minute, then we have a one hour window during which mirrors might appear as "flaky", and go up and down every minute.

Issue - in practice

Is it really an issue? Well, depends on how many files disappear when the repo is updated, compared to the total number of file.

Let's look at the Kali Linux images, in numbers:

# cd /srv/mirrors/kali-images
# find -type f | wc -l
174
# find kali-weekly/ | wc -l
89
# find kali-weekly/ | grep -- -W41- | wc -l
42
# find kali-weekly/ | grep -- -W42- | wc -l
42

To say it words:

Once a week, this repo is updated, a new weekly image is added, and the old weekly image is removed. Meaning: once a week, when the repo is updated, 25% of the files in the repo disappear.

For the health-check, it means that, during a one hour window, it has 1 chance out of 4 to pick a file that is not on the mirror anymore, and to mark the mirror down.

So, once a week, during a one hour window, the mirrors seem to be flaky, and go up and down from mirrorbits point of view. We can see it with this graph that shows around 10 days of data, and that check the availability of an image in the repo. We can clearly see the two moments when the repo was updated with a new weekly image, causing mirrors to be marked up/down by mirrorbits.

installer-2-weeks

Mitigation and possible improvements

The easy mitigation for a mirrorbits user is just to reduce the scan interval (eg. to 30 minutes). It work for Kali Linux images, as there are only 175 files in this repo, so scanning is quick. So it's Ok to reduce the scan interval.

I think this issue could be mitigated in mirrorbits as well, here are a few ideas:

jbkempf commented 10 months ago

I think it is a good idea, yes.

elboulangero commented 10 months ago

Which one? The 404 counter?

lazka commented 6 months ago

imo limiting it to a single file is good enough (similar to TraceFileLocation config wise, even the same file could be used)

Another heuristic would be to check the "newest" file each mirror has according to the last scan, assuming only old files get removed. But that would fail if files constantly get added an removed again.