cpan-testers / cpantesters-backend

Backend data processing for CPAN Testers
Other
0 stars 4 forks source link

Incomplete uploads database table after MetaCPAN outages #27

Open eserte opened 7 months ago

eserte commented 7 months ago

Looking at https://github.com/cpan-testers/cpantesters-backend/blob/master/lib/CPAN/Testers/Backend/FetchUploads.pm and https://github.com/cpan-testers/cpantesters-backend/blob/e8803f410d8dc9cd75ec0af42f8c754d8f787ba4/Rexfile#L126-L135 it seems that freshly uploaded CPAN releases are missing permanently in the database if the MetaCPAN API is down or unreachable for about 20 to 30 minutes. Unfortunately this seems to happen now and then, see also https://github.com/metacpan/metacpan-web/issues/2992 for a recent incident.

This in turn means that test reports for these missing CPAN releases are permanently lost, even if coming after the MetaCPAN outage. A prominent example is Net-SSLeay 1.94, which was released about six weeks ago, and still does not have any test reports listed, see http://matrix.cpantesters.org/?dist=Net-SSLeay (the website is currently showing "NOTE: no report for latest version 1.94").

So what can be done? As a quick fix, I think it would be good to fill the missing bits in the database. Probably running beam run metacpan fetch_uploads without the --since option could help. Maybe try first with increasing intervals (I don't know how the MetaCPAN API behaves if everything without filter is fetched). For the period starting from last Friday I would expect that about 70-80 entries would be added.

What to do as a long-term fix? I am not sure. Outages or network problems of all kinds may always happen. Maybe it would help if the --since period would be permanently increased (to one day? more?), but this would add more load to the MetaCPAN API and the local database. Maybe there could be a rarely running "repair" cronjob which uses a longer --since period. Maybe monitoring could be better (currently it seems that failures to connect to MetaCPAN are not logged at all).

It would also be nice if the possibly existing reports in the database could be repaired by reprocessing them after the uploads table was repaired.

FYI @jkeenan (James: this relates to the post "CPANtesters failing to report distribution name for Net-SSLeay" you wrote some weeks ago) and @andk.

glasswalk3r commented 7 months ago

it seems that freshly uploaded CPAN releases are missing permanently in the database if the MetaCPAN API is down or unreachable for about 20 to 30 minutes.

Is this not related to the report submission? If the API is down, there is no way the report could be submitted, right? If the testers has some mean to keep the report locally (like using metabase-relayd), the report could be submitted again later.

Or there is a part of the flow that not I'm aware of?

eserte commented 7 months ago

It seems that reports for any distribution which is not listed in the cpantesters database are just ignored. You can check http://metabase.cpantesters.org/tail/log.txt --- there are still about 100 out of 1000 lines which have just a [] where the distribution name should be. These reports are lost, and it does not help to wait and send later. Only inserting the missing distribution to the database would help.

jkeenan commented 6 months ago

It seems that reports for any distribution which is not listed in the cpantesters database are just ignored. You can check http://metabase.cpantesters.org/tail/log.txt --- there are still about 100 out of 1000 lines which have just a [] where the distribution name should be. These reports are lost, and it does not help to wait and send later. Only inserting the missing distribution to the database would help.

This problem persists. Today I installed perl-5.39.9 and tried to install ~ 500 CPAN modules against it. I can confirm that Net-SSLeay and MIME-Tools are two distributions where reports were generated, but logged at http://metabase.cpantesters.org/tail/log.txt without their distribution names. We have no recent CPANtesters data for recent releases of these two distros. See: http://fast-matrix.cpantesters.org/?dist=MIME-tools and http://fast-matrix.cpantesters.org/?dist=Net-SSLeay.

eserte commented 6 months ago

Any news on this? Is this a topic we can tackle at PTS 2024?

preaction commented 5 months ago

Yes, @eserte, your summation is correct: If the MetaCPAN API is down 3 times in 30 minutes (once every 10 minutes), CPAN Testers will never get that data, and no reports can possibly be submitted for those distributions. And, also yes, I can manually run that job without the --since argument to rebuild that table from zero.

This shouldn't be possible, I think: It's not CPAN Testers's job to know what is uploaded. If we get a report for something, we should accept it, and we can later decide if we want to display it (once we've verified the report is for a module uploaded to CPAN by an authorized account). So, I think instead of failing if an upload record isn't found, I'll just insert a provisional record in that table (which I thought was the current behavior, but clearly not...)

While investigating this, I also have found that there aren't the UNIQUE constraints I would've expected, so there are multiple records for every dist/version... I'll deal with that presently as well.

jkeenan commented 5 months ago

@preaction, thanks for your investigation. Look forward to the results.

preaction commented 5 months ago

So... Holy mother-forking shirt-balls there are reports in here that have failed to process back to 2019 (which, if I recall correctly, was when I added the code to pull this uploads data from MetaCPAN). My script to repair the uploads data, and my other script to de-duplicate that same data, are running now. Once that is complete, I can put the failed jobs in the queue to run the processor again.

I'm still working on the fix to pre-populate the uploads data if it's missing, but that should be done before the summit is finished.

preaction commented 5 months ago

The 125,000 missing reports are back in the queue (but will probably take quite a bit to chew through). I'm going to finish automated tests for the various fixes to the report processing... process, and then this shouldn't happen again (for this specific reason, at least).

jkeenan commented 5 months ago

So far, so good. I have been able to run reports for Net-SSLeay and MIME-tools against the newly released perl-5.39.10 and have the results reported:

http://fast-matrix.cpantesters.org/?dist=Net-SSLeay;perl=5.39.10;reports=1

http://fast-matrix.cpantesters.org/?dist=MIME-tools;perl=5.39.10;reports=1

preaction commented 5 months ago

Excellent. I've re-prioritized the incoming job queue to put these older reports lower on the queue, so that once again, new reports are processed quickly (to get them on the regular matrix.cpantesters.org). I let a backlog accumulate, though, so we're still behind in processing and likely will be for a couple days.

eserte commented 4 months ago

Things look good now. I think this issue may be closed. What do you think, @preaction?