Collecting download count from GitHub broken

indigoxela commented 5 months ago

EDIT: the problem is not what I thought initially. More findings in the comments.

Most plausible reason is some timeout problem.

Previous (wrong assumption): This is the wrong header:

https://github.com/backdrop-ops/backdropcms.org/blob/main/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L53

Actually the API requires now something like:

curl -s -H "Accept: application/vnd.github.v3+json"   https://api.github.com/repos/backdrop-contrib/REPONAME/releases

Note the header: Accept: application/vnd.github.v3+json And that CURLOPT_USERAGENT in that function seems like an odd decision, too. :wink: And I don't get the Authorization header (all repos are public, so all repo info is, too).

argiepiano commented 5 months ago

It'd be great if you provided a PR - I can test... except that perhaps this could only be tested on the beta version of backdropcms.org, so we would need someone who has access to patching beta - @jenlampton or @bugfolder ?

jenlampton commented 5 months ago

It can be tested by anyone who has our GitHub key, which we'd be willing to provide to trusted community members for testing purposes ;) DM me and I'll send it next time I'm at my computer :)

On Sun, Jun 30, 2024, 7:24 AM Alejandro Cremaschi @.***> wrote:

It'd be great if you provided a PR - I can test... except that perhaps this could only be tested on the beta version of backdropcms.org, so we would need someone who has access to patching beta - @jenlampton https://github.com/jenlampton or @bugfolder https://github.com/bugfolder ?

— Reply to this email directly, view it on GitHub https://github.com/backdrop-ops/backdropcms.org/issues/1040#issuecomment-2198579580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBER7U3M25XQRMQJQ2W4DZKAIJBAVCNFSM6AAAAABKD7CDFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGU3TSNJYGA . You are receiving this because you were mentioned.Message ID: @.***>

argiepiano commented 5 months ago

In fact, to get download counts, there is no need to use a GitHub key. I believe this is now publicly available with the get request @indigoxela posted above.

However, how would we test the PR on the backdropcms.org site? I'm a bit unclear about this.

indigoxela commented 5 months ago

An odd finding here: some of the numbers displayed on B-org seem correct (rules, backup_migrate, imce...). Others for projects that exist for a while are completely wrong (leaflet).

That suggests, that this odd and dated function borg_project_metrics_get_downloads() might eventually still work, and the problem's somewhere else... I can't actually test without the authorization token, but @jenlampton you could. Just to make sure, we're not fixing something, that's not broken. :wink:

Stats ARE broken. But eventually not because of borg_project_metrics_get_downloads(). FTR: stats seem broken for all projects created since 2022 (or so).

Maybe there's something wrong with these newer nodes on B-org, or maybe there are just too many projects to still handle them that way (without batches, just in loops, it seems).

indigoxela commented 5 months ago

Some more comparison of actual numbers and numbers displayed on B-org...

It feels sort of random. Some numbers are correct. The majority is not. The newer a project is the wronger the numbers. Projects created in the last years don't display any download counts at all.

My conclusion: it's the number of projects to loop over. Some timeout or memory exhaustion might strike. Or eventually GitHub sets limits re API calls. Hard to know "from outside".

argiepiano commented 5 months ago

My conclusion: it's the number of projects to loop over. Some timeout or memory exhaustion might strike

That sounds plausible! Perhaps if we used a queue and Backdrop's Queue API for this, instead of trying to load all and loop at once, and risk a timeout.

Github says there is a maximum rate of 5000 requests per hour. It's hard to know how many nodes b-org currently has for projects, themes and layouts. I bet they are more than 5000. Perhaps a cleanup of old nodes is needed, in addition to starting to use a queue.

All of this is really hard to test without full access to the b-org site. We need a person with full access to be on board.

docwilmot commented 5 months ago

@bugfolder is the guy we need for this.

I think also, borg_project_metrics_cron() could use some watchdog messages for success, not just error. And I suspect the first try/catch should be testing borg_project_metrics_get_downloads() not $node->save().

bugfolder commented 5 months ago

Available to test and deploy.

I think anyone can build a local instance of b.org for testing, but (a) it's a lot trickier now that CiviCRM is part of b.org, and (b) there are some things you can't handshake with GH from a local dev site (TMK).

argiepiano commented 5 months ago

I think anyone can build a local instance of b.org for testing, but (a) it's a lot trickier now that CiviCRM is part of b.org, and (b) there are some things you can't handshake with GH from a local dev site (TMK).

Also, a local instance will not have the possibly thousands of nodes pointing at projects.

bugfolder commented 5 months ago

Also, a local instance will not have the possibly thousands of nodes pointing at projects.

Sure they would. That would be the primary point of building a local version of backdropcms.org (as opposed to a new vanilla installation of B), to have all the nodes, etc. You use the sanitized dbs for backdrop and CiviCRM; the former contains the nodes.

jenlampton commented 5 months ago

That sounds plausible! Perhaps if we used a queue and Backdrop's Queue API for this, instead of trying to load all and loop at once, and risk a timeout.

This is a good idea anyway, but I don't think we are hitting the timeout (yet).

There are 1.2k projects in backdrop-contrib, so we can't have more than that many project nodes on b.org (contrib projects include those without releases).

I think anyone can build a local instance of b.org for testing, but (a) it's a lot trickier now that CiviCRM is part of b.org

Maybe we should update the README with some informaiton about how to set up a local site? I'll create a separate issue for that :)

there are some things you can't handshake with GH from a local dev site (TMK).

We should document these (maybe also in the README), as you may be right but I'm not sure exactly what they are. AFAIK the getting information FROM GitHub should work as expected, but pushging things TO GitHub (like the zipped project) is only possible from b.org. We should do some testing to confirm.

argiepiano commented 5 months ago

There are 1.2k projects in backdrop-contrib, so we can't have more than that many project nodes on b.org (contrib projects include those without releases).

Yes, but perhaps there are more than one node per project? Perhaps one per release?

jenlampton commented 5 months ago

Yes, but perhaps there are more than one node per project? Perhaps one per release?

Those are project release nodes (a different type). I don't know if they get usage data separately? They might!

edit: we have 2974 project release nodes, so still not more than the limit.

indigoxela commented 5 months ago

edit: we have 2974 project release nodes, so still not more than the limit.

That brings us back to the initial question: why is this broken?

It's not working. That's a fact. And it seems broken for quite a while now. From outside we can't figure out, what broke. It needs someone to check the logs, eventually debug a bit more, to figure out the culprit. With only the code and possibly even if taking the effort to set up a demo based on this repo, we probably won't figure out.

What I can see is, that this count fetch happens once a day for all projects in one loop. And that seems questionable to me - given the raising number of projects.

That there are "only" 2974 project release nodes does not mean, only 2974 requests will be made to the GH API. There are several fetch jobs and even one job can cause several requests to GH API, if there are many releases (pages). Note also that it seems that, additionally to download counts, there are jobs to fetch commits from core and pulls from CiviCRM (possibly more?).

Here the fetch happens (via cron, all at once): https://github.com/backdrop-ops/backdropcms.org/blob/62994720640f1da760fb23dc1380e776a69a525a/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L119

This is the actual fetch: https://github.com/backdrop-ops/backdropcms.org/blob/62994720640f1da760fb23dc1380e776a69a525a/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L43

argiepiano commented 5 months ago

I feel the culprit is a timeout - but as @indigoxela, it's hard to tell unless someone can install a local version using the current database and config.

bugfolder commented 5 months ago

I've started putting together instructions for setting up a local instance with the current (sanitized) db and config. Should have something in the next few days, will ping here when it's in place.

argiepiano commented 5 months ago

I'm thinking that testing that this is a timeout or memory issue will be tricky locally, as this may be dependent on server settings. I know so little about Apache and servers - do they typically create some sort of error log when a process times out or runs out of memory? If so, has anyone checked the logs for b-org?

jenlampton commented 5 months ago

Timeout and memory issues are in PHP, which has good logging. There are a bunch of PHP errors in the logs on the live site (which should be addressed), but nothing about timeout or memory limits. :/

The PHP configurations can be seen at https://backdropcms.org/admin/reports/status/php (if you have access) but we should really add a .ddev/config file into the core repo so that anyone working on the site can get a local setup identical to the live server.

bugfolder commented 5 months ago

For those wanting to set up a local instance, I've created a PR on https://github.com/backdrop-ops/backdropcms.org/issues/1041 with instructions that you can try out. If you find anything unclear or non-working, post a comment on the issue and I'll update the PR. DM me on Zulip if you need credentials for the sanitized db site.

bugfolder commented 5 months ago

Meanwhile, I installed our GH token credentials on my local, enabled Devel module, and ran this code to get downloads for a single module:

$temp = borg_project_metrics_get_downloads('webform');
dpm($temp,'$temp');

Result was "0". Which seems a bit low.

Stepping into the debugger, after the curl call and separating out the header and body, I get this result for the body:

{"message":"Not Found","documentation_url":"https://docs.github.com/rest/repos/repos#get-a-repository","status":"404"}

The value of $nextUrl was

https://api.github.com/repos/webform/releases

I'm just starting to read up, so if this triggers any feedback or suggestions for changes, please post.

argiepiano commented 5 months ago

The value of $nextUrl was https://api.github.com/repos/webform/releases

This URL is wrong. The correct one should be https://api.github.com/repos/backdrop-contrib/webform/releases

argiepiano commented 5 months ago

But I think the problem is that you are calling the function borg_project_metrics_get_downloads with the wrong parameter. I believe borg_project_metrics_get_project_nodes() would actually return backdrop-contrib/webform, so you are typing the name of the project without the path.

argiepiano commented 5 months ago

Oops closed by accident

bugfolder commented 5 months ago

Ah.

$temp = borg_project_metrics_get_downloads('backdrop-contrib/webform');
dpm($temp,'$temp');

returns 10245, which is the correct number. So that, at least, works.

argiepiano commented 5 months ago

Can you check backdrop-contrib/views_bulk_operations? The download count shown on the b-org page is rather low, lower than the current active installations. When I manually add the download_count returned by the API, I get 810, but b-org says 161

indigoxela commented 5 months ago

Again, as it seems, this has been missed: the download count seems correct for some projects, but is wrong for most. The newer a project is, the wronger the numbers get. Projects created in the past two years, don't get download stats at all.

The project nodes for the loop get collected without sorting in the db query, so I'd assume, the ordering is as-is from the database (nid). Which means, older projects get handled earlier in the loop, than newer projects. At some point things stop working completely, that's why newer projects end up without any stats, it seems. That there's no PHP nagging, doesn't mean, nothing's wrong. :wink:

bugfolder commented 5 months ago

Can you check backdrop-contrib/views_bulk_operations?
$temp = borg_project_metrics_get_downloads('backdrop-contrib/views_bulk_operations');
dpm($temp,'$temp');
This returns 818.

argiepiano commented 5 months ago

This returns 818.

Yes! So this means that:

The API works as expected
This is most likely a problem with the loop - most likely a timeout or memory problem. As @indigoxela mentioned, the newest projects go last.

So, if the problem is timeout, perhaps we can try handling this with batch calls. Right now this is all handled in one page request. Or if the problem is memory, then by using BackdropQueue and breaking it down into several cron calls.

bugfolder commented 5 months ago

The first loop in borg_project_metrics_cron() gets all project nodes (borg_project_metrics_get_project_nodes() returns 962 entries), loops over them to get the downloads, and then in the try/catch loop, saves the value in $node->field_download_count for the node if it's non-zero.

The loop runs every time $date['G'] == 22, i.e., once a day. Wouldn't that mean that every project node should have a modification date of 1 day ago or more recent? Because only the first 326 do in Admin > Content (filtered to type Project Module).

And Views Bulk Operations shows its last modification date of June 27, 2024.

Perhaps it would be useful in borg_project_metrics_get_downloads(), at the end when it's computing the total, if the body returned from curl didn't contain the download count (which might be because it's returning some timeout error or something like that), we post what the body actually was to watchdog and then go look for messages of that sort the next day?

indigoxela commented 5 months ago

@argiepiano It might have to do with timeout, but there's another possibility to consider: rate limit.

We considered requrests per hour, but it might be, there's an additional limit per shorter time span.

When playing with fetching (without authorization token, though), I realized the x-ratelimit-... headers. Wonder, what the values are for authenticated requests.

Is it possible to drop in some debug/dpm here for the the values? https://github.com/backdrop-ops/backdropcms.org/blob/62994720640f1da760fb23dc1380e776a69a525a/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L64

@bugfolder I'm particularly interested in values like x-ratelimit-remaining and x-ratelimit-limit for authenticated calls (which I can't test).

It's a bit unfortunate, that the records don't store their "last fetched" timestamp. That way it would be easier to split fetches into multiple cron runs. I hope, we can solve this via batch or something else, because otherwise it's getting quite a big change in logic...

bugfolder commented 5 months ago

@indigoxela, dpm($myHeader) gives

bugfolder commented 5 months ago

The x-ratelimit-reset timestamp is one hour from when I sent the request.

bugfolder commented 5 months ago

It's a bit unfortunate, that the records don't store their "last fetched" timestamp. That way it would be easier to split fetches into multiple cron runs. I hope, we can solve this via batch or something else, because otherwise it's getting quite a big change in logic...

If we used state_get/state_set to store and retrieve the last timestamp for each project within that loop it wouldn't be a huge change in logic, would it? Of course we'd have 962 new entries in the state table, but storing "last timestamp" is explicitly one of the things the state table is intended for. If we ran that loop a couple times a day, but didn't check individual projects until they're at least 24 hours old, that should have the effect of spreading out the calls across multiple loops, right?

bugfolder commented 5 months ago

Of course we'd have 962 new entries in the state table...

And I see that even a single call to state_get() by anyone reads in the entire state table, so this would impose a permanent cost to any page loads if they're checking state for any reason. (I don't have a good sense of whether that's too high a price to pay.)

argiepiano commented 5 months ago

I don't think it's necessary to use the state table to store last fetched. This can be done in a field for that content type. Still, I don't think this is necessary. We could just split the calls using backdrop's queue into batches of, say 100 nodes per process, and run these every hour upon cron calls, until all of them have been processed. The original queue would be set up at 22 hours, and we could potentially run this process 23 times after that, each time with 100 nodes.

@bugfolder how often is cron set up to run in b-org?

bugfolder commented 5 months ago

how often is cron set up to run in b-org?

crontab -l shows a call scheduled for every 5 minutes.

# Run regular cron every 5 minutes.
*/5 * * * * wget -O - -q -t 1 https://backdropcms.org/core/cron.php?cron_key=<snip>

bugfolder commented 5 months ago

Sorry, briefly visible previous watchdog image (now deleted) was of my dev site (d'oh!). b.org is showing completion every 5 minutes with no issues.

And I see that there's a throttle in the code of no more than once per hour for borg_project_metrics_cron(),

indigoxela commented 5 months ago

@bugfolder many thanks for your digging. So, rate-limits aren't the culprit at all. Good to know. :+1:

Yes, the throttle in borg_project_metrics_cron() prevents excessive runs. That's a good thing.

@argiepiano I'm not sure if we could rely solely on batches without info, when the last fetch happened. Maybe I'm missing something, but it feels sort of flaky...

IF we store the last_fetched timestamp somewhere... it seems to me, the states table isn't the right spot. We'd need the relation node->nid => last_fetch. It could be a timestamp field on the node. But is a (visible) field on the node the right thing, anyway? Not sure. On the other hand, that info would be available via node_load. But a custom db table seems more efficient to me (two columns, nid and timestamp, I'd assume).

The logic in borg_project_metrics_get_project_nodes() would need an overhaul to account for last_fetched and pull (let's say) a range of 200 at max, those not fetched for the longest time first. And borg_project_metrics_cron() would have to write the updated timestamp along with node->save. Other thoughts?

indigoxela commented 5 months ago

The loop runs every time $date['G'] == 22, i.e., once a day. Wouldn't that mean that every project node should have a modification date of 1 day ago or more recent? Because only the first 326 do

That's an interesting finding. Yes, obviously only 326 nodes got their stats updated, which means, 636 projects have incorrect or missing download stats. :stuck_out_tongue_winking_eye: That confirms my finding from manual checking of a handful of random projects.

indigoxela commented 5 months ago

Wait a minute... crazy idea: node->save will update the node's changed timestamp. Why don't we simply use that timestamp as "last fetched" value?

No new field, no new table, no states... just node changed. :bulb:

Then we'd only have to update one function - borg_project_metrics_get_project_nodes() to also sort and filter by "changed" (and db_query_range with, lets say, 200).

indigoxela commented 5 months ago

That might work:

$types = array('project_module', 'project_theme', 'project_layout', 'core');
$needs_update = time() - 3600 * 24;
$result = db_query_range("SELECT nid FROM {node} WHERE type IN (:types) AND status = 1 AND changed < :needs_update ORDER BY changed ASC", 0, 200, array(
  ':types' => $types,
  ':needs_update' => $needs_update,
))->fetchAll();

That would replace this part: https://github.com/backdrop-ops/backdropcms.org/blob/62994720640f1da760fb23dc1380e776a69a525a/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L17-L25

A small change... :wink:

Wait again, no, we'd also have to remove (or adapt) the if (date('G') == 22) {} wrapped around the code in function borg_project_metrics_cron(). Still not such a huge change.

Downside: if the node got updated by other means (manual edit, new release), the condition isn't met (for one day).

indigoxela commented 4 months ago

So silent here? Possibly you're all waiting for a PR?

Well, here is one: https://github.com/backdrop-ops/backdropcms.org/pull/1043

I wasn't able to actually test - besides running similar db queries via Devel.

argiepiano commented 4 months ago

Hi! Sorry, I'm on the road and unavailable for 2 weeks. A quick look at this - if the problem was generated by a timeout or memory problem due to an overload of nodes to be checked, I'm not sure this PR is addressing that problem. Did I miss something in the discussion about the cause for this? I see that this is running on every cron run except at 23 hours, and I also see that only nodes with an updated time of more than 24 hours are loaded, but this doesn't assure that the code will load and process a manageable number of nodes per cron run. I believe it's better to go the route of splitting the number of loaded nodes into an equal, safe, hard-coded number of nodes (50 nodes per cron?) and process those. In order to do that, a Backdrop queue will be the way to go, where the queue stores nids (not full nodes) and the queue gets fully loaded once a day, then on subsequent crons, 50 items from the queue are processed at a time.

[EDIT]: I'm aware this is still an imperfect solution, as this may only work if the total number of nodes is less 14,350 (50 nodes every 5 minutes for 23 hours and 55 minutes), but this number probably won't be reached in a while.

I may be missing something in this message, I can take a closer look when I'm back home in a week or so. Still, I'd have a hard time testing, since we would need the github token to fully recreate a local version for testing.

indigoxela commented 4 months ago

Hi! Sorry, I'm on the road and unavailable for 2 weeks.

That's OK, enjoy, or I wish you success - whatever fits best. :wink:

I also see that only nodes with an updated time of more than 24 hours are loaded...

Yes, but only 200 at max - that's a subtle difference. :wink: That's why my PR switches to db_query_range() - it picks the 200 oldest project nodes that haven't been updated in the past 24 hrs.

We can discuss this (now hardcoded) 200 - it's based on the finding by @bugfolder, that 326 nodes seem to get updated daily (if I got that right). So 200 should work without problems.

With our current number of projects, we could go down to - lets say - 50 per cron run. But we'd possibly have to adapt that number when lots of new projects get added. However, with my approach this only means, that the update per project wouldn't be daily anymore, but with a little longer delay. We can decide.

But first of all testing is required. I couldn't properly test this.

argiepiano commented 4 months ago

Ah, thanks, I had missed the db_query_range part.

On vacation right now, so "enjoy" is right! :)

bugfolder commented 4 months ago

I started to test this PR (and will do more testing later today), but a quick notice of something: in the logic, if borg_project_metrics_get_downloads($m[0]) returns 0 for any reason, the node will not get updated, and so a search on its changed date will keep turning it up. So if there's a bunch of old nodes with 0 downloads (or if for some reason GH is returning 0—I need to look into this further), this loop will keep looking at those nodes, to the exclusion of others.

    foreach ($project_nodes as $m) {
      $num = borg_project_metrics_get_downloads($m[0]);
      if ($num) {
        try {
          $node = node_load($m[1]);
          $node->field_download_count['und'][0]['value'] = $num;
          $node->save();
        }
        catch(Exception $e) {
          $message = $e->getMessage();
          watchdog('borg_project_metrics', $message, array(), WATCHDOG_ERROR);
        }
      }
    }

jenlampton commented 4 months ago

I still think it's still a big improvement over what we have now. Maybe it's okay that we check for the first download of a project more often?

bugfolder commented 4 months ago

if borg_project_metrics_get_downloads($m[0]) returns 0 for any reason, the node will not get updated...

And here's some possible reasons. I installed the patch, then put a debugger breakpoint on the totaling line in borg_product_metrics_get_downloads(), in this bit:

    $json = json_decode($body, TRUE);
    if (!empty($json)) {
      foreach ($json as $j) {
        $total += isset($j['assets'][0]['download_count']) ? $j['assets'][0]['download_count'] : 0; // BREAK HERE
      }
    }

And then took a look around each time it stopped for the first 10 or so projects.

First project was backdrop-contrib/paywall_braintree. The return value will be zero, because $body is this array:

{"message":"Not Found","documentation_url":"https://docs.github.com/rest/releases/releases#list-releases","status":"404"}

So this means that this project will show up every single time, since its project node is never modified. And the loop itself is inefficient: $body should contain a single array, and so if it contains 3 strings, we know something's amiss; we don't need to check every string for whether it's an array containing an assets key.

Next project: backdrop-contrib/paywall_stripe - same thing.

Next project: backdrop-contrib/insert-view - similar failure, but "Moved Permanently," rather than "Not found".

Next project: backdrop-contrib/token - this returns a proper array that contains the 'assets' key, however, $j['assets'] is an empty array; it doesn't include a download key. That's because the Token project is a stub since Token module is now in core.

Net project: backdrop-contrib/pure_css - Moved Permanently

...A few more of "Moved Permanently" or "Not found".

Next project: backdrop-contrib/jquery_rss - returns an array, but $j['assets'] is empty. I observe that jQuery.RSS Block project provides no download files, so that's probably related.

And finally, we get to backdrop-contrib/google_cse. This works, i.e., the right number of downloads is fetched and the number is recorded in the node and it's updated to the new value (was 51, now is 61).

After letting cron finish running once, I see 48 project nodes that got updated, from which I infer that there's 152 nodes like the above clogging the system, since they get called every time because their changed date is never updated.

Running cron again bumps the number of updated projects to 82.

I infer that the patch basically works, in that repeated cron runs would eventually get all the projects updated, but we should modify the code somehow so that projects that return 0 for downloads aren't repeatedly updated. Perhaps just update their changed date anyhow without actually changing anything.

bugfolder commented 4 months ago

Quoth @jenlampton

I still think it's still a big improvement over what we have now...

(That appeared while I was typing up the results of my analysis above.) I agree,the patch is a big improvement. Seems, though, that as long as we're fixing this, we could fix it in such a way as to eliminate the clogging projects from each run. If, sometime in the future, the number of clogging projects got to be over 200, we'd be back in this situation again.

jenlampton commented 4 months ago

I thought we weren't actually hitting a rate limit?

If not, why don't we just batch and do all the projects every day, rather than only 200? I think it's reasonable to use the updated time for the sort (so we start with those that were updated longest ago) but we should still go through all of them every day and the 200 count won't be an issue.

backdrop-ops / backdropcms.org

Collecting download count from GitHub broken #1040