Closed adrianreber closed 8 years ago
If this specific commit broke something then we have a larger problem as before this commit we were trying to use a non-existing dict :-/
I have added some debug output to the function try_per_file()
which right now has trouble scanning the mirrors. try_per_file()
gets as parameter d
a Directory
object and loops over the elements of the directory to get all files from the mirrors. There are two different objects in Directory
which give information about the included directories. There is d.files
and d.fileDetails
.
d.files
is a dict and we used to loop over the keys of the dict. This actually contains all (or the newest files I think) of the current directory. Your commit changed the loop to loop over d.fileDetails
which only contains files for which we store mdsums and shasums. This leads to not looping over files without *sums and thus not scanning any other directories than repodata directories.
For comparison here is a print
of d.files
for a non-repodata directory:
{'globus-xio-debuginfo-0-3.2-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '3711'}, 'header.src.info': {'stat': 1326741540, 'size': '0'}, 'globus-xio-gsi-driver-debuginfo-0-2.1-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '2096'}, 'fex-debuginfo-0-1.20100416.2814-2.el4.ppc.hdr': {'stat': 1326669624, 'size': '1514'}, 'libidn2-debuginfo-0-0.8-1.el4.ppc.hdr': {'stat': 1326669624, 'size': '4387'}, 'globus-xio-popen-driver-debuginfo-0-2.2-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '2014'}, 'header.info': {'stat': 1326741540, 'size': '5326'}, 'nwipe-debuginfo-0-0.06-2.el4.ppc.hdr': {'stat': 1326741540, 'size': '2274'}, 't1lib-debuginfo-0-5.0.2-2.ppc.hdr': {'stat': 1326268238, 'size': '1886'}, 'globus-xio-pipe-driver-debuginfo-0-2.1-1.el4.ppc.hdr': {'stat': 1326268238, 'size': '1591'}}
and d.filesDetails
:
[]
the same for a repodata directory:
`{'20050ef902df2efc7a43e025b7ce4042a802eb31-primary.xml.gz': {'stat': 1330563141, 'size': '7222'}, 'updateinfo.xml.gz': {'stat': 1330563141, 'size': '10536'}, 'd4cfd0a2db6e1cae4d666ebe5063d0635bcaaff3-filelists.xml.gz': {'stat': 1330563141, 'size': '1883'}, '61f86b6cf57723478c8900e9555dce5fb0073b92-primary.sqlite.bz2': {'stat': 1330563141, 'size': '11118'}, 'a52a2bf6a8cffeb37c895f5171f5c6f815ebc327-other.xml.gz': {'stat': 1330563141, 'size': '9797'}, '4fd7be747395e68e5f50a032feb9b7960f31ad17-filelists.sqlite.bz2': {'stat': 1330563141, 'size': '3318'}, 'repomd.xml': {'stat': 1330563141, 'size': '2918'}, '736468f6f034c65b5e418206d02f638bcb5f6eec-other.sqlite.bz2': {'stat': 1330563141, 'size': '13287'}}``
and d.fileDetails
:
[<mirrormanager2.lib.model.FileDetail object at 0x2a831450>]
So the original code which looped over the keys of d.files
was doing the right thing (for most of the cases).
This means d.files
isn't the result of a DB mapping but an attribute that is set elsewhere (not my favorite design tbh).
Sounds like you're right and we should revert the commit mentioned above, problem is: I specifically remember that commit fixing a problem we were having (just can't remember which one it was).
It was a hack because millions of directory entries times tens of hundreds of files was way too slow to query. Hence the hacks too to keep only a few most recently updated files in the directories with lots of RPMs and .html files. On Sep 18, 2015 7:40 AM, "Pierre-Yves Chibon" notifications@github.com wrote:
This means d.files isn't the result of a DB mapping but an attribute that is set elsewhere (not my favorite design tbh).
Sounds like you're right and we should revert the commit mentioned above, problem is: I specifically remember that commit fixing a problem we were having (just can't remember which one it was).
— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/issues/131#issuecomment-141440603 .
On Fri, Sep 18, 2015 at 06:39:43AM -0700, Matt Domsch wrote:
It was a hack because millions of directory entries times tens of hundreds of files was way too slow to query. Hence the hacks too to keep only a few most recently updated files in the directories with lots of RPMs and .html files.
I wonder if we could move the hack to be a @property
of the object.
Food for thoughts :)
I had a look at the code and the d.files
pickle is filled in by umdl and read by the crawler.
Unfortunately I also do not remember why you changed the code to be like it is now. I remember that it crashed in some cases but not why. So we need to revert the commit and probably add a check if d.files
is empty so that we cannot loop over it.
FTR, the original PR: https://github.com/fedora-infra/mirrormanager2/pull/107
Thanks for finding the original PR. I have not reverted the commits from that PR but changed the code so that it now uses the dict again. I will update this issue tomorrow after having tested my changes.
@adrianreber if you changes are working, don't forget to make a PR so that we can make a release :)
This has been fixed in #138
@pypingou your commit ced912addae7e407536a4696a6d9676eb0135d1e broke HTTP crawling. Most directories are now skipped. I have seen this now on multiple hosts but was only able to track it down today. This needs to be reverted and the problem which this tried to fix needs to be fixed in some other way.
An up to date host which does only provide HTTP for scanning has only the following directories listed as up to date. Only repodata directories and no directories with actual content.