datalad / datalad-usage-dashboard

Dashboard of detected usages of DataLad
MIT License
4 stars 2 forks source link

Program individually queries every GitHub repository, leading to problems #41

Closed jwodder closed 7 months ago

jwodder commented 7 months ago

When looking for DataLad datasets on GitHub, most of the information about the datasets found is returned from search requests, which list multiple repositories per page. However, for various reasons, the program still ends up making a separate GitHub API request for each & every dataset, both those found by searching and those previously recorded but not found in the current run.

As a result, in a run of find_datalad_repos where no new datasets are found, we end up making (in addition to the search requests) a separate request for each & every dataset already recorded in datalad-repos.json, which causes the program to exceed GitHub's 5000/hour request limit.

At this point, I can't tell what exactly is happening, as the log level in CI runs is only INFO rather than DEBUG, and some of the historical CI run times are all over the place. However, one important factor is that, since PR #31, the program uses my ghreq library for querying GitHub, which (by default) will not spend more than 5 minutes total sleeping on a request — and, since the release on 2023-12-17, if waiting for the rate limit to reset would take more than five minutes, ghreq doesn't both waiting at all and just immediately reraises the 403 error. Unfortunately, as per PR #11, 403 responses when getting a respository's details are treated the same as 404's, so it's possible that a number of repositories are erroneously being marked "gone" when they shouldn't be.

One possible solution to this would be to use the GraphQL API to query repository details, which allows making requests for multiple repositories at once. However, the way the GraphQL rate limit works is a bit inscrutable, so this might not work either, and the GraphQL API uses different IDs for repositories than the ones that are recorded in datalad-repos.json.

CC @yarikoptic

yarikoptic commented 7 months ago

re hitting the limits. I think there is no huge need to check every repo every day for either it is still there or to update its stars. What if we add last-checked datetime, and then check only up to 1000 least recently checked repos on each run? Also we can avoid checking altogether if last checked e.g. less than a week ago. Ideally it should be not 1000 but "until we hit the limit", so that we do test a good bunch, save results, and then would check only next ones (or none if all checked and nothing within week period) in the next run. WDYT?

jwodder commented 7 months ago

@yarikoptic

yarikoptic commented 7 months ago

when a repo is returned in a search result, we get its stars

do you mean that it is a separate call for stars? if so -- yes, we can apply the same "delayed checking" I think.

Note that https://github.com/datalad/datalad-usage-dashboard/pull/42 modifies the code so that both checks will now update both stars and the "last updated" timestamp.

great, makes sense.

  • Do you want to continue treating (non-rate-limit-related) 403 responses when querying repository details the same as 404s?

sounds from your description like we would like to remove its treatment there as "Gone" and treat it as "limit exceeded" right?