Program individually queries every GitHub repository, leading to problems

jwodder commented 7 months ago

When looking for DataLad datasets on GitHub, most of the information about the datasets found is returned from search requests, which list multiple repositories per page. However, for various reasons, the program still ends up making a separate GitHub API request for each & every dataset, both those found by searching and those previously recorded but not found in the current run.

Search results for some reason do not include the number of stars each repository has (nor the timestamps needed by #40), so, for each repository found via searching, we need to make a separate request to get its details and extract the number of stars.
If a repository was previously recorded in datalad-repos.json but was not returned in the current search results, we make a request for that repository's information in order to check whether it has been either deleted or else replaced by a new repository with the same name but a different ID.
- Note that, although datalad-repos.json currently contains 5250 known datasets, searching for datasets on GitHub only returns 438 results. I believe this is because (as stated in the docs), "Only repositories that have had activity or have been returned in search results in the last year are searchable."

As a result, in a run of find_datalad_repos where no new datasets are found, we end up making (in addition to the search requests) a separate request for each & every dataset already recorded in datalad-repos.json, which causes the program to exceed GitHub's 5000/hour request limit.

At this point, I can't tell what exactly is happening, as the log level in CI runs is only INFO rather than DEBUG, and some of the historical CI run times are all over the place. However, one important factor is that, since PR #31, the program uses my ghreq library for querying GitHub, which (by default) will not spend more than 5 minutes total sleeping on a request — and, since the release on 2023-12-17, if waiting for the rate limit to reset would take more than five minutes, ghreq doesn't both waiting at all and just immediately reraises the 403 error. Unfortunately, as per PR #11, 403 responses when getting a respository's details are treated the same as 404's, so it's possible that a number of repositories are erroneously being marked "gone" when they shouldn't be.

One possible solution to this would be to use the GraphQL API to query repository details, which allows making requests for multiple repositories at once. However, the way the GraphQL rate limit works is a bit inscrutable, so this might not work either, and the GraphQL API uses different IDs for repositories than the ones that are recorded in datalad-repos.json.

CC @yarikoptic

yarikoptic commented 7 months ago

re hitting the limits. I think there is no huge need to check every repo every day for either it is still there or to update its stars. What if we add last-checked datetime, and then check only up to 1000 least recently checked repos on each run? Also we can avoid checking altogether if last checked e.g. less than a week ago. Ideally it should be not 1000 but "until we hit the limit", so that we do test a good bunch, save results, and then would check only next ones (or none if all checked and nothing within week period) in the next run. WDYT?

jwodder commented 7 months ago

@yarikoptic

Note that, as stated above, there are two code paths that lead to querying individual repositories: when a repo is returned in a search result, we get its stars; and when going through all the repositories after a search, any that weren't returned in the search are checked for existence. Do you want this "delayed checking" to apply to both kinds of checks? Note that #42 modifies the code so that both checks will now update both stars and the "last updated" timestamp.
Do you want to continue treating (non-rate-limit-related) 403 responses when querying repository details the same as 404s? That behavior was added in #11 in order to possibly deal with private repositories, but I'm pretty sure that GitHub returns 404 for those (so as not to leak even the existence of private repos to people who shoudn't know about them).

yarikoptic commented 7 months ago

when a repo is returned in a search result, we get its stars

do you mean that it is a separate call for stars? if so -- yes, we can apply the same "delayed checking" I think.

Note that https://github.com/datalad/datalad-usage-dashboard/pull/42 modifies the code so that both checks will now update both stars and the "last updated" timestamp.

great, makes sense.

Do you want to continue treating (non-rate-limit-related) 403 responses when querying repository details the same as 404s?

sounds from your description like we would like to remove its treatment there as "Gone" and treat it as "limit exceeded" right?

datalad / datalad-usage-dashboard

Program individually queries every GitHub repository, leading to problems #41