backdrop-ops / backdropcms.org

Issue tracker for the BackdropCMS.org website
https://backdropcms.org
25 stars 19 forks source link

Collecting download count from GitHub broken #1040

Open indigoxela opened 3 days ago

indigoxela commented 3 days ago

EDIT: the problem is not what I thought initially. More findings in the comments.

It might or might not have to do with some rate limit for API calls.

https://docs.github.com/de/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28

(Authenticated requests have a higher primary rate limit than unauthenticated requests.)

Or it's a server-side problem like memory or timeout...

Previous (wrong assumption): This is the wrong header:

https://github.com/backdrop-ops/backdropcms.org/blob/main/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L53

Actually the API requires now something like:

curl -s -H "Accept: application/vnd.github.v3+json"   https://api.github.com/repos/backdrop-contrib/REPONAME/releases

Note the header: Accept: application/vnd.github.v3+json And that CURLOPT_USERAGENT in that function seems like an odd decision, too. :wink: And I don't get the Authorization header (all repos are public, so all repo info is, too).

argiepiano commented 3 days ago

It'd be great if you provided a PR - I can test... except that perhaps this could only be tested on the beta version of backdropcms.org, so we would need someone who has access to patching beta - @jenlampton or @bugfolder ?

jenlampton commented 2 days ago

It can be tested by anyone who has our GitHub key, which we'd be willing to provide to trusted community members for testing purposes ;) DM me and I'll send it next time I'm at my computer :)

On Sun, Jun 30, 2024, 7:24 AM Alejandro Cremaschi @.***> wrote:

It'd be great if you provided a PR - I can test... except that perhaps this could only be tested on the beta version of backdropcms.org, so we would need someone who has access to patching beta - @jenlampton https://github.com/jenlampton or @bugfolder https://github.com/bugfolder ?

— Reply to this email directly, view it on GitHub https://github.com/backdrop-ops/backdropcms.org/issues/1040#issuecomment-2198579580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBER7U3M25XQRMQJQ2W4DZKAIJBAVCNFSM6AAAAABKD7CDFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGU3TSNJYGA . You are receiving this because you were mentioned.Message ID: @.***>

argiepiano commented 2 days ago

In fact, to get download counts, there is no need to use a GitHub key. I believe this is now publicly available with the get request @indigoxela posted above.

However, how would we test the PR on the backdropcms.org site? I'm a bit unclear about this.

indigoxela commented 2 days ago

An odd finding here: some of the numbers displayed on B-org seem correct (rules, backup_migrate, imce...). Others for projects that exist for a while are completely wrong (leaflet).

That suggests, that this odd and dated function borg_project_metrics_get_downloads() might eventually still work, and the problem's somewhere else... I can't actually test without the authorization token, but @jenlampton you could. Just to make sure, we're not fixing something, that's not broken. :wink:

Stats ARE broken. But eventually not because of borg_project_metrics_get_downloads(). FTR: stats seem broken for all projects created since 2022 (or so).

Maybe there's something wrong with these newer nodes on B-org, or maybe there are just too many projects to still handle them that way (without batches, just in loops, it seems).

indigoxela commented 2 days ago

Some more comparison of actual numbers and numbers displayed on B-org...

It feels sort of random. Some numbers are correct. The majority is not. The newer a project is the wronger the numbers. Projects created in the last years don't display any download counts at all.

My conclusion: it's the number of projects to loop over. Some timeout or memory exhaustion might strike. Or eventually GitHub sets limits re API calls. Hard to know "from outside".

argiepiano commented 2 days ago

My conclusion: it's the number of projects to loop over. Some timeout or memory exhaustion might strike

That sounds plausible! Perhaps if we used a queue and Backdrop's Queue API for this, instead of trying to load all and loop at once, and risk a timeout.

Github says there is a maximum rate of 5000 requests per hour. It's hard to know how many nodes b-org currently has for projects, themes and layouts. I bet they are more than 5000. Perhaps a cleanup of old nodes is needed, in addition to starting to use a queue.

All of this is really hard to test without full access to the b-org site. We need a person with full access to be on board.

docwilmot commented 2 days ago

@bugfolder is the guy we need for this.

I think also, borg_project_metrics_cron() could use some watchdog messages for success, not just error. And I suspect the first try/catch should be testing borg_project_metrics_get_downloads() not $node->save().

bugfolder commented 1 day ago

Available to test and deploy.

I think anyone can build a local instance of b.org for testing, but (a) it's a lot trickier now that CiviCRM is part of b.org, and (b) there are some things you can't handshake with GH from a local dev site (TMK).

argiepiano commented 1 day ago

I think anyone can build a local instance of b.org for testing, but (a) it's a lot trickier now that CiviCRM is part of b.org, and (b) there are some things you can't handshake with GH from a local dev site (TMK).

Also, a local instance will not have the possibly thousands of nodes pointing at projects.

bugfolder commented 1 day ago

Also, a local instance will not have the possibly thousands of nodes pointing at projects.

Sure they would. That would be the primary point of building a local version of backdropcms.org (as opposed to a new vanilla installation of B), to have all the nodes, etc. You use the sanitized dbs for backdrop and CiviCRM; the former contains the nodes.

jenlampton commented 1 day ago

That sounds plausible! Perhaps if we used a queue and Backdrop's Queue API for this, instead of trying to load all and loop at once, and risk a timeout.

This is a good idea anyway, but I don't think we are hitting the timeout (yet).

There are 1.2k projects in backdrop-contrib, so we can't have more than that many project nodes on b.org (contrib projects include those without releases).

I think anyone can build a local instance of b.org for testing, but (a) it's a lot trickier now that CiviCRM is part of b.org

Maybe we should update the README with some informaiton about how to set up a local site? I'll create a separate issue for that :)

there are some things you can't handshake with GH from a local dev site (TMK).

We should document these (maybe also in the README), as you may be right but I'm not sure exactly what they are. AFAIK the getting information FROM GitHub should work as expected, but pushging things TO GitHub (like the zipped project) is only possible from b.org. We should do some testing to confirm.

argiepiano commented 1 day ago

There are 1.2k projects in backdrop-contrib, so we can't have more than that many project nodes on b.org (contrib projects include those without releases).

Yes, but perhaps there are more than one node per project? Perhaps one per release?

jenlampton commented 1 day ago

Yes, but perhaps there are more than one node per project? Perhaps one per release?

Those are project release nodes (a different type). I don't know if they get usage data separately? They might!

edit: we have 2974 project release nodes, so still not more than the limit.

indigoxela commented 1 day ago

edit: we have 2974 project release nodes, so still not more than the limit.

That brings us back to the initial question: why is this broken?

It's not working. That's a fact. And it seems broken for quite a while now. From outside we can't figure out, what broke. It needs someone to check the logs, eventually debug a bit more, to figure out the culprit. With only the code and possibly even if taking the effort to set up a demo based on this repo, we probably won't figure out.

What I can see is, that this count fetch happens once a day for all projects in one loop. And that seems questionable to me - given the raising number of projects.

That there are "only" 2974 project release nodes does not mean, only 2974 requests will be made to the GH API. There are several fetch jobs and even one job can cause several requests to GH API, if there are many releases (pages). Note also that it seems that, additionally to download counts, there are jobs to fetch commits from core and pulls from CiviCRM (possibly more?).

Here the fetch happens (via cron, all at once): https://github.com/backdrop-ops/backdropcms.org/blob/62994720640f1da760fb23dc1380e776a69a525a/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L119

This is the actual fetch: https://github.com/backdrop-ops/backdropcms.org/blob/62994720640f1da760fb23dc1380e776a69a525a/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L43

argiepiano commented 1 day ago

I feel the culprit is a timeout - but as @indigoxela, it's hard to tell unless someone can install a local version using the current database and config.

bugfolder commented 1 day ago

I've started putting together instructions for setting up a local instance with the current (sanitized) db and config. Should have something in the next few days, will ping here when it's in place.

argiepiano commented 1 day ago

I'm thinking that testing that this is a timeout or memory issue will be tricky locally, as this may be dependent on server settings. I know so little about Apache and servers - do they typically create some sort of error log when a process times out or runs out of memory? If so, has anyone checked the logs for b-org?

jenlampton commented 22 hours ago

Timeout and memory issues are in PHP, which has good logging. There are a bunch of PHP errors in the logs on the live site (which should be addressed), but nothing about timeout or memory limits. :/

The PHP configurations can be seen at https://backdropcms.org/admin/reports/status/php (if you have access) but we should really add a .ddev/config file into the core repo so that anyone working on the site can get a local setup identical to the live server.