fedora-infra / mirrormanager2

Rewrite of the MirrorManager application in Flask and SQLAlchemy
https://mirrormanager.fedoraproject.org
GNU General Public License v2.0
63 stars 48 forks source link

HTTP crawl skips most files #131

Closed adrianreber closed 8 years ago

adrianreber commented 8 years ago

@pypingou your commit ced912addae7e407536a4696a6d9676eb0135d1e broke HTTP crawling. Most directories are now skipped. I have seen this now on multiple hosts but was only able to track it down today. This needs to be reverted and the problem which this tried to fix needs to be fixed in some other way.

An up to date host which does only provide HTTP for scanning has only the following directories listed as up to date. Only repodata directories and no directories with actual content.

4/i386/debug/repodata
4/i386/repodata
4/ppc/debug/repodata
4/ppc/repodata
4/SRPMS/repodata
4/x86_64/debug/repodata
4/x86_64/repodata
5/i386/debug/repodata
5/i386/repodata
5/ppc/debug/repodata
5/ppc/repodata
5/SRPMS/repodata
5/x86_64/debug/repodata
5/x86_64/repodata
6/i386/debug/repodata
6/i386/repodata
6/ppc64/debug/repodata
6/ppc64/repodata
6/SRPMS/repodata
6/x86_64/debug/repodata
6/x86_64/repodata
7/ppc64/debug/repodata
7/ppc64/repodata
7/SRPMS/repodata
7/x86_64/debug/repodata
7/x86_64/repodata
testing/4/i386/debug/repodata
testing/4/i386/repodata
testing/4/ppc/debug/repodata
testing/4/ppc/repodata
testing/4/SRPMS/repodata
testing/4/x86_64/debug/repodata
testing/4/x86_64/repodata
testing/5/i386/debug/repodata
testing/5/i386/repodata
testing/5/ppc/repodata
testing/5/SRPMS/repodata
testing/5/x86_64/debug/repodata
testing/5/x86_64/repodata
testing/6/i386/debug/repodata
testing/6/i386/repodata
testing/6/ppc64/debug/repodata
testing/6/ppc64/repodata
testing/6/SRPMS/repodata
testing/6/x86_64/debug/repodata
testing/6/x86_64/repodata
testing/7/ppc64/debug/repodata
testing/7/x86_64/debug/repodata
testing/7/x86_64/repodata
pypingou commented 8 years ago

If this specific commit broke something then we have a larger problem as before this commit we were trying to use a non-existing dict :-/

adrianreber commented 8 years ago

I have added some debug output to the function try_per_file() which right now has trouble scanning the mirrors. try_per_file() gets as parameter d a Directory object and loops over the elements of the directory to get all files from the mirrors. There are two different objects in Directory which give information about the included directories. There is d.files and d.fileDetails.

d.files is a dict and we used to loop over the keys of the dict. This actually contains all (or the newest files I think) of the current directory. Your commit changed the loop to loop over d.fileDetails which only contains files for which we store mdsums and shasums. This leads to not looping over files without *sums and thus not scanning any other directories than repodata directories.

For comparison here is a print of d.files for a non-repodata directory:

{'globus-xio-debuginfo-0-3.2-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '3711'}, 'header.src.info': {'stat': 1326741540, 'size': '0'}, 'globus-xio-gsi-driver-debuginfo-0-2.1-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '2096'}, 'fex-debuginfo-0-1.20100416.2814-2.el4.ppc.hdr': {'stat': 1326669624, 'size': '1514'}, 'libidn2-debuginfo-0-0.8-1.el4.ppc.hdr': {'stat': 1326669624, 'size': '4387'}, 'globus-xio-popen-driver-debuginfo-0-2.2-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '2014'}, 'header.info': {'stat': 1326741540, 'size': '5326'}, 'nwipe-debuginfo-0-0.06-2.el4.ppc.hdr': {'stat': 1326741540, 'size': '2274'}, 't1lib-debuginfo-0-5.0.2-2.ppc.hdr': {'stat': 1326268238, 'size': '1886'}, 'globus-xio-pipe-driver-debuginfo-0-2.1-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '1591'}}

and d.filesDetails:

[]

the same for a repodata directory:

`{'20050ef902df2efc7a43e025b7ce4042a802eb31-primary.xml.gz': {'stat': 1330563141, 'size': '7222'}, 'updateinfo.xml.gz': {'stat': 1330563141, 'size': '10536'}, 'd4cfd0a2db6e1cae4d666ebe5063d0635bcaaff3-filelists.xml.gz': {'stat': 1330563141, 'size': '1883'}, '61f86b6cf57723478c8900e9555dce5fb0073b92-primary.sqlite.bz2': {'stat': 1330563141, 'size': '11118'}, 'a52a2bf6a8cffeb37c895f5171f5c6f815ebc327-other.xml.gz': {'stat': 1330563141, 'size': '9797'}, '4fd7be747395e68e5f50a032feb9b7960f31ad17-filelists.sqlite.bz2': {'stat': 1330563141, 'size': '3318'}, 'repomd.xml': {'stat': 1330563141, 'size': '2918'}, '736468f6f034c65b5e418206d02f638bcb5f6eec-other.sqlite.bz2': {'stat': 1330563141, 'size': '13287'}}``

and d.fileDetails:

[<mirrormanager2.lib.model.FileDetail object at 0x2a831450>]

So the original code which looped over the keys of d.files was doing the right thing (for most of the cases).

pypingou commented 8 years ago

This means d.files isn't the result of a DB mapping but an attribute that is set elsewhere (not my favorite design tbh).

Sounds like you're right and we should revert the commit mentioned above, problem is: I specifically remember that commit fixing a problem we were having (just can't remember which one it was).

mdomsch commented 8 years ago

It was a hack because millions of directory entries times tens of hundreds of files was way too slow to query. Hence the hacks too to keep only a few most recently updated files in the directories with lots of RPMs and .html files. On Sep 18, 2015 7:40 AM, "Pierre-Yves Chibon" notifications@github.com wrote:

This means d.files isn't the result of a DB mapping but an attribute that is set elsewhere (not my favorite design tbh).

Sounds like you're right and we should revert the commit mentioned above, problem is: I specifically remember that commit fixing a problem we were having (just can't remember which one it was).

— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/issues/131#issuecomment-141440603 .

pypingou commented 8 years ago

On Fri, Sep 18, 2015 at 06:39:43AM -0700, Matt Domsch wrote:

It was a hack because millions of directory entries times tens of hundreds of files was way too slow to query. Hence the hacks too to keep only a few most recently updated files in the directories with lots of RPMs and .html files.

I wonder if we could move the hack to be a @property of the object.

Food for thoughts :)

adrianreber commented 8 years ago

I had a look at the code and the d.files pickle is filled in by umdl and read by the crawler.

Unfortunately I also do not remember why you changed the code to be like it is now. I remember that it crashed in some cases but not why. So we need to revert the commit and probably add a check if d.files is empty so that we cannot loop over it.

pypingou commented 8 years ago

FTR, the original PR: https://github.com/fedora-infra/mirrormanager2/pull/107

adrianreber commented 8 years ago

Thanks for finding the original PR. I have not reverted the commits from that PR but changed the code so that it now uses the dict again. I will update this issue tomorrow after having tested my changes.

pypingou commented 8 years ago

@adrianreber if you changes are working, don't forget to make a PR so that we can make a release :)

adrianreber commented 8 years ago

This has been fixed in #138