Collecting download count from GitHub broken

indigoxela commented 5 months ago

EDIT: the problem is not what I thought initially. More findings in the comments.

Most plausible reason is some timeout problem.

Previous (wrong assumption): This is the wrong header:

https://github.com/backdrop-ops/backdropcms.org/blob/main/www/modules/custom/borg_project_metrics/borg_project_metrics.module#L53

Actually the API requires now something like:

curl -s -H "Accept: application/vnd.github.v3+json"   https://api.github.com/repos/backdrop-contrib/REPONAME/releases

Note the header: Accept: application/vnd.github.v3+json And that CURLOPT_USERAGENT in that function seems like an odd decision, too. :wink: And I don't get the Authorization header (all repos are public, so all repo info is, too).

bugfolder commented 4 months ago

If not, why don't we just batch and do all the projects every day, rather than only 200?

My understanding of reading @indigoxela's PR is that it will do them all every day in hourly batches of 200, always starting with those that are updated longest ago. But it looks like the first 150 or so in each batch of 200 are the same oldsters who never get updated. For example, two successive runs of the cron job return an array with backdrop-contrib/paywall_braintree as the first entry in the array of 200. For it, and ones like it, we shouldn't check it again until the next day.

So I think that @indigoxela's approach in the PR would accomplish what you requested; it just needs to update the nodes that returned 0 (for whatever reason) so they won't get checked again for 24 hours.

bugfolder commented 4 months ago

Maybe we also need another issue to remove those project nodes (like https://backdropcms.org/project/paywall_braintree) that correspond to nonexistent GH projects. Perhaps have the metrics cron job post a watchdog report for nonexistent projects as it encounters them to make them visible to administrators.

jenlampton commented 4 months ago

But it looks like the first 150 or so in each batch of 200 are the same oldsters who never get updated.

It sounds like the batch information might not be updating properly if we're catching the same items again before we've gone through the whole COUNT once. Here's an example of what the batch update could look like:

function hook_update_N(&$sandbox) {
  if (!isset($sandbox['progress'])) {
    // Initialize batch update information.
    $sandbox['progress'] = 0;
    $sandbox['last_processed'] = 0;
    state_set('current_update_started', time());
    // Get the total number of items to update.
    $substitutions = array(':nid' => $sandbox['last_processed'], ':time' => state_get('current_update_started')); 
    $count_query = "SELECT COUNT(nid) WHERE updated < :time";
    $sandbox['max'] = db_query($count_query, $substitutions)->fetchField();
  }

  // Retrieve the items that need to be updated.
  $query = "SELECT nid, updated WHERE updated < :time AND nid > :nid ORDER BY nid"
  $limit = 200;
  $substitutions = array(':nid' => $sandbox['last_processed'], ':time' => state_get('current_update_started'));
  $result = db_query_range($query, 0, $limit, $substitutions);

  foreach ($result as $record) {
    $node = node_load($record->nid);
    ...[update the project count]...
    $node->save();

    // Update progress information for the batch update.
    $sandbox['progress']++;
    $sandbox['last_processed'] = $node->nid;
    $sandbox['last_timestamp'] = $node->updated;

    // Are we finished?
    $sandbox['#finished'] = $sandbox['progress'] / $sandbox['max']; 
    if ($sandbox['#finished'] >= 1) {
      state_set('update_completed_time', time());
      return t("Project data updated.");
    }
  }
}

Note: this code is completely unstested and intended only to indicate intent :)

indigoxela commented 4 months ago

@bugfolder that's an interesting finding. So we have zombie nodes - that show up on B-org, but their repo moved or got deleted?

We could catch that by verifying, that curl comes with a successful http 200.

The logic problem with zero downloads needs consideration, too. 0 is a valid download count. :thinking:

@jenlampton looks like you're coming up with a totally different approach, did I get that right? Does this make my attempt (PR) obsolete?

indigoxela commented 4 months ago

Updated my PR to consider some of the concerns.

Consider http return code when getting download count
Skip the loose check for $num, as borg_project_metrics_get_downloads() always returns a number, and 0 is a valid one

However, I've no idea how to handle the "zombie project nodes". That's not something, the metrics module can do. But now they should get updated, too, to not further "stick in the update queue", and watchdog warns.

Again: test cautiously, as I couldn't.

jenlampton commented 4 months ago

My code example was based off an update hook batch, so not directly applicable. I didn't see any batching in the current PR so I expect that happens elsewhere in the module? I haven't looked at it yet, just thinking "out loud" about how to solve the problems mentioned, since it sounded like they were batch-related :)

On Mon, Jul 8, 2024, 10:07 PM indigoxela @.***> wrote:

Updated my PR to consider some of the concerns.

Consider http return code when getting download count

Skip the loose check for $num, as borg_project_metrics_get_downloads() always returns a number, and 0 is a valid one

However, I've no idea how to handle the "zombie project nodes". That's not something, the metrics module can do. But now they should get updated, too, to not further "stick in the update queue".

Again: test cautiously, as I couldn't.

— Reply to this email directly, view it on GitHub https://github.com/backdrop-ops/backdropcms.org/issues/1040#issuecomment-2216533986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBERZYEBKPYPUGGI5HKE3ZLNVYJAVCNFSM6AAAAABKD7CDFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGUZTGOJYGY . You are receiving this because you were mentioned.Message ID: @.***>

jenlampton commented 4 months ago

We should probably solve the zombie nodes problem in a other issue. Agree they aren't really part of the project count solution.

On Mon, Jul 8, 2024, 11:09 PM Jen Lampton @.***> wrote:

My code example was based off an update hook batch, so not directly applicable. I didn't see any batching in the current PR so I expect that happens elsewhere in the module? I haven't looked at it yet, just thinking "out loud" about how to solve the problems mentioned, since it sounded like they were batch-related :)

On Mon, Jul 8, 2024, 10:07 PM indigoxela @.***> wrote:

Updated my PR to consider some of the concerns.

Consider http return code when getting download count

Skip the loose check for $num, as borg_project_metrics_get_downloads() always returns a number, and 0 is a valid one

However, I've no idea how to handle the "zombie project nodes". That's not something, the metrics module can do. But now they should get updated, too, to not further "stick in the update queue".

Again: test cautiously, as I couldn't.

— Reply to this email directly, view it on GitHub https://github.com/backdrop-ops/backdropcms.org/issues/1040#issuecomment-2216533986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBERZYEBKPYPUGGI5HKE3ZLNVYJAVCNFSM6AAAAABKD7CDFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGUZTGOJYGY . You are receiving this because you were mentioned.Message ID: @.***>

indigoxela commented 4 months ago

It may be totally nonsense, but something odd came to my mind... Cron runs every 5 minutes on B-org? Doesn't that mean, we have to get our stuff done within 5 minutes, before the next job starts? I'm no expert with Backdrop cron, though. If it were so, eventually the reported problem with broken stats started when cron has been switched to 5min?

jenlampton commented 4 months ago

Do we know how long it takes currently? I have some long running cron tasks on other sites and it's not really and issue if another cron is skipped, but if it skips 5 or more that can start to be an issue.

On Tue, Jul 9, 2024, 8:35 AM indigoxela @.***> wrote:

It may be totally nonsense, but something odd came to my mind... Cron runs every 5 minutes on B-org? Doesn't that mean, we have to get our stuff done within 5 minutes, before the next job starts? I'm no expert with Backdrop cron, though. If it were so, eventually the reported problem with broken stats started when cron has been switched to 5min?

— Reply to this email directly, view it on GitHub https://github.com/backdrop-ops/backdropcms.org/issues/1040#issuecomment-2218034838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBER5OAJQ3KA4TT7B24BDZLP7K3AVCNFSM6AAAAABKD7CDFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGAZTIOBTHA . You are receiving this because you were mentioned.Message ID: @.***>

bugfolder commented 4 months ago

Currently watchdog is reporting "Cron run completed" on the 5's (e.g., 10:20, 10:25, ...), which suggests that it's taking less than a minute.

I observed in backdrop_cron_run() that cron acquires a 240-second lock and reports "'Attempting to re-run cron while it is already running" to watchdog if another cron run is still running. Not seeing that in the dblog.

indigoxela commented 4 months ago

@bugfolder more interesting findings... :wink:

Can you double check, that you're inspecting the right hour? debug(date('G, e')); should also display the timezone identifier. It might, for example, be that the server timezone is UTC.

That the "Attempting to re-run..." message doesn't show up at all in the logs is good news, though. We still have no evidence, why the actual job (to fetch download data) doesn't work as expected, though. Or, more precisely, why it only works for a part of items.

I'm curious, what testing the latest changes in my PR does, hoping for some improvement. :wink:

bugfolder commented 4 months ago

I'm curious, what testing the latest changes in my PR does, hoping for some improvement.

Code certainly looks good, and I'll test it as soon as I can shake loose some time (and the right kind of time). Hoping soon.

bugfolder commented 4 months ago

I've done some testing locally (running cron manually, then resetting borg_project_metrics_cron in the state table so I could run it again.) Download collection happened with each run, and no repetition from zombie nodes.

Also, as intended, problems getting downloads are now reported in watchdog, identifying the problematic project. After 3 cron runs, there were 19 watchdog reports. Some examples:

Attempt to fetch download count from https://api.github.com/repos/backdrop-contrib/basic_cart_braintree/releases resulted in HTTP 404. — this is appropriate, there is no basic_cart_braintree project in backdrop-contrib.
Attempt to fetch download count from https://api.github.com/repos/backdrop-contrib/stripe_api/releases resulted in HTTP 301. — This is appropriate, because stripe_api has changed to stripe.
Attempt to fetch download count from https://api.github.com/repos/backdrop-contrib/token/releases resulted in HTTP 0. — the https://github.com/backdrop-contrib/token project reports "This repository has been archived by the owner...is now read-only."
Attempt to fetch download count from https://api.github.com/repos/backdrop-contrib/riddler/releases resulted in HTTP 0. — This one is a bit more puzzling, because https://github.com/backdrop-contrib/riddler exists and has releases.

And so forth and so on.

At this point, it appears that we've fixed the immediate problem, and with 200 checks 23 time per day, we can handle a very large number of future projects. So I'd advocate that we merge this into the site repo and then check back after a day.

And we'll have the exceptions/zombies represented in watchdog, so the separate issue will have some grist to work with.

indigoxela commented 4 months ago

That HTTP 0 for riddler's indeed a riddle :wink: (I'm maintainer of this module and in my own stats module the download count fetch works just fine).

So it might be that HTTP 0 has to get handled with care - it probably only means, the server couldn't connect at all. AFAIK this 0 stands for "host not found". But actually this is rather a topic for the "Zombies issue".

bugfolder commented 4 months ago

So maybe if HTTP return code was 0, one should not update the downloads number to zero, in hopes that the next attempt will get through to the server. Perhaps detect that situation and return a total of -1, and then in the update loop, if the total is -1, don't update the downloads number for the node, but still save the node to update its changed date?

jenlampton commented 4 months ago

That sounds great.

On Thu, Jul 11, 2024, 9:34 AM Robert J. Lang @.***> wrote:

So maybe if HTTP return code was 0, one should not update the downloads number to zero, in hopes that the next attempt will get through to the server. Perhaps detect that situation and return a total of -1, and then in the update loop, if the total is -1, don't update the downloads number for the node, but still save the node to update its changed date?

— Reply to this email directly, view it on GitHub https://github.com/backdrop-ops/backdropcms.org/issues/1040#issuecomment-2223390640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBER2OH5GQLTHQ75L27HTZL2XXRAVCNFSM6AAAAABKD7CDFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGM4TANRUGA . You are receiving this because you were mentioned.Message ID: @.***>

indigoxela commented 4 months ago

So maybe if HTTP return code was 0, one should not update the downloads number to zero, in hopes that the next attempt will get through to the server.

That's a good point. We wouldn't want the count to get reset to 0 in that case... :thinking:

indigoxela commented 4 months ago

PR updated (yet another time: test cautiously).

In case of network problems, that aren't in any way related to the repo, it now skips over it. To pick that node up again in the next run. I don't think it's a good idea to still update the unchanged node in that case, as then the next fetch attempt for this repo would be made a day later instead of an hour later. The problem still gets reported in dblog, though. (Should it?)

bugfolder commented 4 months ago

The problem still gets reported in dblog, though. (Should it?)

Yes, I think so. If it only shows up once, it can be ignored, but if it's happening every hour, something needs looking into.

bugfolder commented 4 months ago

Sorry for the long delay; I was traveling. Now back. I've tested out the latest PR on my local site. Although it generated a slew of "Network problem while fetching..." reports in watchdog, it successfully updated 80-odd modules, and a second cron call updated another 80 or so.

Most importantly, there was no smoke or flames. So I am in favor of going ahead and merging this PR and deploying it. We can see if the "network problem" afflicts the real server, or was peculiar to my local installation, and meanwhile, lots of modules that had never been getting updated will be brought up to date.

jenlampton commented 4 months ago

Let's do it! thank you everyone for helping get this resolved.

klonos commented 4 months ago

Great teamwork here!!! 🙌🏼 🎉 ❤️

I am currently on holidays, so trying to keep up with the chat and issue queues as best as I can while the girls are sleeping at night...

Can we please add some inline comments for the sake of people reviewing the code in the future, so that certain things don't seem "magic"? I went through this thread, and here's a few of the comments that I think should be captured as inline comments:

... my PR switches to db_query_range() - it picks the 200 oldest project nodes that haven't been updated in the past 24 hrs.

... if HTTP return code was 0, one should not update the downloads number to zero, in hopes that the next attempt will get through to the server. Perhaps detect that situation and return a total of -1, and then in the update loop, if the total is -1, don't update the downloads number for the node, but still save the node to update its changed date?

That's a good point. We wouldn't want the count to get reset to 0 in that case

In case of network problems, that aren't in any way related to the repo, it now skips over it. To pick that node up again in the next run. I don't think it's a good idea to still update the unchanged node in that case, as then the next fetch attempt for this repo would be made a day later instead of an hour later.

The bits in the code that seem appropriate for these comments seem to be the following:

  $max_per_cron = 200;
  $result = db_query_range("SELECT nid FROM {node} WHERE type IN (:types) AND status = 1 AND changed < :one_day_ago ORDER BY changed ASC", 0, $max_per_cron, array(

      ), WATCHDOG_WARNING);
      $total = -1;
      break;

...and/or here:

      // The value "-1" indicates some sort of network problem.
      if ($num != -1) {

Also, this was not apparent to me straight away:

  // Prevent overlap with weekly jobs.
  if (date('G') != 23) {

...I was left wondering "Why? How? ...and what weekly jobs?".

Sorry, I can't provide a PR at the moment - most likely after a couple of weeks(?) ...unless someone beats me to it.

backdrop-ops / backdropcms.org

Collecting download count from GitHub broken #1040