fedora-infra / mirrormanager2

Rewrite of the MirrorManager application in Flask and SQLAlchemy
https://mirrormanager.fedoraproject.org
GNU General Public License v2.0
63 stars 46 forks source link

umdl: add fullfiletimelist-* based master scanning #207

Closed adrianreber closed 7 years ago

adrianreber commented 7 years ago

Disk based scanning all modules of the Fedora master mirror takes hours. A full scan takes (at least) more than 6 hours and stat()s every directory in the whole tree. This creates a massive amount of IOPS which are basically only used to detect that most directories have not changed.

With the newly created fullfiletimelist-* files for fedora-quick-mirror (https://pagure.io/quick-fedora-mirror/) this can be changed and a nearly stat()-less master mirror scan is possible and implemented by this commit.

Important to remember is that the result will always only be as good as the input files (fullfiletimelist-*).

With this commit umdl detects if a 'fullfiletimelist-*' file exists and instead of walking the directory tree only the found file is parsed. To switch back to disk based scanning a new switch is introduced:

--skip-fullfiletimelist Do not look for a fullfiletimelist-*; actually scan the filesystem

Using the fullfiletimelist-* files umdl is much faster:

So it is much faster but heavily depends on the correctness of the fullfiletimelist-* files. Currently a single stat() for every changed directory is necessary to detect if the directory is readable or not (pre-bitflip scenario mainly). The feature request for quick-fedora-mirror to include this information already exists: https://pagure.io/quick-fedora-mirror/issue/40

To read the *-CHECKSUM files for the information about the ISOs disk access is still required (could also be downloaded via https or rsync) and also for the checksums of the repomd.xml file for the metalink.

One point which made this whole thing more complicated than necessary is the 'Fedora Linux' category. All other category topdir point to /srv/pub/ . Only 'Fedora Linux' points to /srv/pub//linux -> /srv/pub/fedora/linux .

The fullfiletimelist-fedora, however, starts at /fedora so that the paths of this single category cannot be joined as easily as for the other category. This makes the code at some places unnecessarily complicated.

adrianreber commented 7 years ago

206

adrianreber commented 7 years ago

I know this is difficult to review, but it runs successfully in staging and I would like to get it deployed in prod. So hoping to get a review to get merged soon.

dustymabe commented 7 years ago

anyone else able to review this?

adrianreber commented 7 years ago

Thanks for all the reviews. I tried to address all of the comments I was able to.

I updated the PR and rerun the tests in staging.

adrianreber commented 7 years ago

Any other reviews, comments? If not, I would like to merge and make a new release.

jeremycline commented 7 years ago

I think at this point it's fine to merge it 👍