Closed jtreanor closed 5 years ago
I have contacted Github about this and I have also found some discussion of a similar issue here: https://github.community/t5/GitHub-API-Development-and/Random-401-errors-after-using-freshly-generated-installation/m-p/22905.
It looks like the solution may be to reduce the number of API calls directly after a new token is generated and to implement retries on 401 errors.
Ouch, yeah - I have it creating a new token all over the place.
I was gonna recommend what you did with your @octokit/app
branch. Now that I think about it, I saw this issue a bunch of times in the Artsy peril, just not enough to warrant diving in further.
Sigh. Maybe we should merge that cached call and just be an explicit 2s delay after generating a token in getTemporaryAccessTokenForInstallation
?
Sigh. Maybe there should just be an explicit 2s delay after generating a token in getTemporaryAccessTokenForInstallation?
Yeah 😞If we cache then token like I have done on that branch and have a 2 second delay each time a new one is created, that should be enough to work around it.
I have actually worked around this issue on our instance by caching the Dangerfiles as well as the token, so we very rarely need to request them. I doubt that would be acceptable in the general case so a delay is probably the way to go.
Well.... Maybe it could? Peril will get notifications when Dangerfiles are changed, the settings JSON would list of all of them and peril will recieve push / pull merged notifications. I do this for the settings JSON here
Oh! Interesting. I didn't realise Peril did this.
That assumes that Peril is installed on the settings repo, which in our case it isn't. We share one Peril settings repo (https://github.com/Automattic/peril-settings) with Peril instances for the Automattic
, wordpress-mobile
and woocommerce
orgs. Since PERIL_ORG_INSTALLATION_ID
is hardcoded, an instance/Github app is needed for each.
However, perhaps adding multi-org support would be wiser than engineering for this unusual use case 😄
Yeah, tricky - I run Artsy's as a multi-org but to do that you start getting into real complex infra (a mongo db/AWS lambdas etc)
I feel like I made it pretty binary either "easy = simple" like now via ENVs, or full on "you need to really grok the system" architecture (for per-org runtime safety) but I it might not be too hard to make multi-org work on just one system if the mongo setup could be decoupled
Yeah 😄I'll take a look at the multi-org stuff and see if its feasible without too much effort.
Caching Dangerfiles still could be fine for us though. I can live with restarting the instances when they are updated if we don't get the webhooks.
Either way I'll look at implementing some kind of workaround for this issue next week.
As expected, this has been resolved by https://github.com/danger/peril/pull/446 and https://github.com/danger/peril/pull/443. However, it is still possible for the 401 errors to happen within Danger-js. I have opened https://github.com/danger/danger-js/pull/898 to add similar retry logic to Danger itself.
Here is a log from our app running master showing this error in Danger:
Request failed [401]: https://api.github.com/repos/wordpress-mobile/gutenberg-mobile/pulls/1240
Response: {
"message": "Bad credentials",
"documentation_url": "https://developer.github.com/v3"
}
Since we started running Peril several months back we have been seeing intermittent 401 failures in
getGitHubFileContents
. Re-running always solves the problem but it has been increasingly disruptive recently and I would love to get to the bottom of this.Here are the logs for a recent failure (I have added a few extra lines). As expected, this leads the Peril run to fail:
I have tried everything I can think of to resolve this but I've had no luck so far. Here is what I know:
getGitHubFileContents
).getGitHubFileContents
is called fromrunPRRun
orvalidateRuns
inpr.ts
. It is called 3 times for each file on every run.@octokit/app
, which caches tokens so they aren't re-created on each run. This is still failing. Although, interestingly the failures I have seen with this implementation are immediately after a new token is generated. You can see my branch with these attempted fixes here.Automattic/peril-settings
) is in a different Github org to the repo Peril is running on. However, they were originally both in thewordpress-mobile
org and we still had these problems.The only ideas I have left seem unlikely. Perhaps this is a Github bug or some kind of misleading rate limiting. Alternatively, it could be a problem with
node-fetch
or Peril'sapi
function ingithub_helpers.ts
. I enabledLOG_FETCH_REQUESTS
and everything looked as expected though.Any ideas? I am at a loss.