Open evaherrada opened 2 years ago
We are now often seeing the job exceed the 6 hour job runtime limit. For example: https://github.com/adafruit/circuitpython-org/actions/runs/6809744073/job/18516615845.
There is some GitHub API rate limiting going on too (see in that job), but it would be really nice to make this run faster.
Thinking about your local test in #361, are we doing the queries without credentials? If we did them with credentials, would we avoid the rate limit?
I'm not sure about authorization. I filed #348 thinking that this process was using Classic PATs (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-personal-access-token-classic) which are deprecated or discouraged; I forget the exact context that led to me filing that issue but it seems to be related to this note in the docs:
Note: Organization owners can restrict the access of personal access token (classic) to their organization. If you try to use a personal access token (classic) to access resources in an organization that has disabled personal access token (classic) access, your request will fail with a 403 response. Instead, you must use a GitHub App, OAuth app, or fine-grained personal access token.
Using PATs of some kind seems to be the Proper Way (TM) to do this, but "fine-grained" PATs are the preferred way now, if we don't want to deal with having a GitHub App: https://docs.github.com/en/actions/security-guides/automatic-token-authentication#granting-additional-permissions
I am pretty sure SOME kind of authentication/token is at play here, otherwise the unauthenticated limit is something like 20 API calls an hour, far too few.
My local test was without authorization, but that was kinda-nice since it let me easily hit the timeout case. For the run in circuitpython-org that's troubling us, we appear to be using a token via ADABOT_GITHUB_USER and _ACCESS_TOKEN:
Thu, 09 Nov 2023 09:20:13 GMT ADABOT_GITHUB_USER: ***
Thu, 09 Nov 2023 09:20:13 GMT ADABOT_GITHUB_ACCESS_TOKEN: ***
since the value of repository secrets can't be inspected, I don't know what user & token is in play.
I think these tokens are OK, because I changed them recently, carefully, and other Actions jobs depend on them.
A successful run looks like this:
Sat, 25 Nov 2023 09:19:00 GMT Run Date: 25 November 2023, 09:19AM
Sat, 25 Nov 2023 09:19:00 GMT - Report output will be saved to: /home/runner/work/circuitpython-org/circuitpython-org/bin/adabot/libraries.v2.json
Sat, 25 Nov 2023 10:02:45 GMT GitHub API Rate Limit reached. Pausing until Rate Limit reset.
Sat, 25 Nov 2023 10:02:45 GMT Rate Limit will reset at: 2023-11-25 10:19:13
Sat, 25 Nov 2023 10:19:13 GMT GitHub API Rate Limit reached. Pausing until Rate Limit reset.
Sat, 25 Nov 2023 10:20:59 GMT {
Sat, 25 Nov 2023 10:20:59 GMT "updated_at": "2023-11-25T09:19:00Z",
[rest of json snipped]
so we have:
A failed run:
Fri, 24 Nov 2023 09:20:27 GMT - Report output will be saved to: /home/runner/work/circuitpython-org/circuitpython-org/bin/adabot/libraries.v2.json
Fri, 24 Nov 2023 10:06:30 GMT GitHub API Rate Limit reached. Pausing until Rate Limit reset.
Fri, 24 Nov 2023 10:20:39 GMT Rate Limit will reset at: 2023-11-24 10:20:39
Fri, 24 Nov 2023 10:20:39 GMT GitHub API Rate Limit reached. Pausing until Rate Limit reset.
Fri, 24 Nov 2023 10:20:39 GMT Rate Limit will reset at: 2023-11-24 10:20:39
Fri, 24 Nov 2023 15:20:09 GMT Error: The operation was canceled.
here,
Besides adding debugging another idea is to prepend the command with timeout
so that we can hopefully get a Python traceback from where the process is stuck, something like timeout -s INT 18000 [adabot command]
. this would be a change in the circuitpython-org repo, not here, I think.
also we could sleep +60 seconds after we think the rate limit will reset, not +1 second; since we hit the rate limit after 40-45 minutes and it resets after 1 hour, this doesn't really lower our throughput much.
Can we insert delays between our requests? Would that throttle it enough to avoid the rate limit?
I think I found the real problem :tada:
Fixed by #362.
I saw another failed 6-hour run, despite circuitpython-org being updated to include #362. :confused: https://github.com/adafruit/circuitpython-org/actions/runs/7030445965 Reopening @jepler
oh drat
I have thought about moving part of the CI into libraries themselves somehow. The basic idea is to have repos trigger when updated (or scheduled) to push a record of the information gathered either to a central repo (like a folder in this repository, for example) or possibly an S3 bucket. This has the advantage of spreading out or eliminating API calls, as well as distributing the burden over multiple CI runs across the libraries. This would allow this CI to just collect them all.
Results of my initial tests of the functions that get run.
Total time per repo:
3.5s
validate_actions_state:0.459673s
validate_contents:0.433544s
validate_core_driver_page:0.003178s
validate_default_branch:0.000001s
validate_in_pypi:0.040591s
validate_labels:0.117371s
validate_passes_linting:0.639910s
validate_readthedocs:1.361857s
validate_release_state:0.333909s
validate_repo_state:0.134701s
So the real big one is validate_readthedocs. That one is definitely one of the more complicated ones but I think we could for sure simplify it.
validate_passes_linting also takes a while but it actually clones the repository so that make sense.