getGitHubFileContents fails intermittently with 401 "Bad Credentials"

jtreanor commented 5 years ago

Since we started running Peril several months back we have been seeing intermittent 401 failures in getGitHubFileContents. Re-running always solves the problem but it has been increasingly disruptive recently and I would love to get to the bottom of this.

Here are the logs for a recent failure (I have added a few extra lines). As expected, this leads the Peril run to fail:

app/web.1:  info: ## pull_request.synchronize on unknown on wordpress-mobile/WordPress-iOS 
app/web.1:  info:    5 runs needed: Automattic/peril-settings@org/common.ts, Automattic/peril-settings@org/ios-macos.ts, Automattic/peril-settings@org/github/label.ts, Automattic/peril-settings@org/github/milestone.ts and Automattic/peril-settings@org/github/diff-size.ts 
app/web.1:  error: Getting GitHub file failed: {"message":"Bad credentials","documentation_url":"https://developer.github.com/v3"} 
app/web.1:  error: url: https://api.github.com/repos/Automattic/peril-settings/contents/org/ios-macos.ts? 
app/web.1:  error: status: 401 
app/web.1:  error: headers: {} 
app/web.1:  error: body: {"message":"Bad credentials","documentation_url":"https://developer.github.com/v3"}

I have tried everything I can think of to resolve this but I've had no luck so far. Here is what I know:

This currently happens a few times a day. Frequency seems to have increased when we split our rules into multiple Dangerfiles (more calls to getGitHubFileContents).
The failures happen when getGitHubFileContents is called from runPRRun or validateRuns in pr.ts. It is called 3 times for each file on every run.
Occurrence seems to be completely random. Sometimes it happens when there are a number of simultaneous runs, other times it happens directly after boot.
Despite the error, the token is definitely valid. I have added some code to retry the request with the exact same parameters and it almost always succeeds.
In an effort to verify that Peril's token generation logic is not the problem, I tried replacing it with @octokit/app, which caches tokens so they aren't re-created on each run. This is still failing. Although, interestingly the failures I have seen with this implementation are immediately after a new token is generated. You can see my branch with these attempted fixes here.
It may be notable that the settings repo (Automattic/peril-settings) is in a different Github org to the repo Peril is running on. However, they were originally both in the wordpress-mobile org and we still had these problems.

The only ideas I have left seem unlikely. Perhaps this is a Github bug or some kind of misleading rate limiting. Alternatively, it could be a problem with node-fetch or Peril's api function in github_helpers.ts. I enabled LOG_FETCH_REQUESTS and everything looked as expected though.

Any ideas? I am at a loss.

jtreanor commented 5 years ago

I have contacted Github about this and I have also found some discussion of a similar issue here: https://github.community/t5/GitHub-API-Development-and/Random-401-errors-after-using-freshly-generated-installation/m-p/22905.

It looks like the solution may be to reduce the number of API calls directly after a new token is generated and to implement retries on 401 errors.

orta commented 5 years ago

Ouch, yeah - I have it creating a new token all over the place.

I was gonna recommend what you did with your @octokit/app branch. Now that I think about it, I saw this issue a bunch of times in the Artsy peril, just not enough to warrant diving in further.

Sigh. Maybe we should merge that cached call and just be an explicit 2s delay after generating a token in getTemporaryAccessTokenForInstallation?

jtreanor commented 5 years ago

Sigh. Maybe there should just be an explicit 2s delay after generating a token in getTemporaryAccessTokenForInstallation?

Yeah 😞If we cache then token like I have done on that branch and have a 2 second delay each time a new one is created, that should be enough to work around it.

I have actually worked around this issue on our instance by caching the Dangerfiles as well as the token, so we very rarely need to request them. I doubt that would be acceptable in the general case so a delay is probably the way to go.

orta commented 5 years ago

Well.... Maybe it could? Peril will get notifications when Dangerfiles are changed, the settings JSON would list of all of them and peril will recieve push / pull merged notifications. I do this for the settings JSON here

jtreanor commented 5 years ago

Oh! Interesting. I didn't realise Peril did this.

That assumes that Peril is installed on the settings repo, which in our case it isn't. We share one Peril settings repo (https://github.com/Automattic/peril-settings) with Peril instances for the Automattic, wordpress-mobile and woocommerce orgs. Since PERIL_ORG_INSTALLATION_ID is hardcoded, an instance/Github app is needed for each.

However, perhaps adding multi-org support would be wiser than engineering for this unusual use case 😄

orta commented 5 years ago

Yeah, tricky - I run Artsy's as a multi-org but to do that you start getting into real complex infra (a mongo db/AWS lambdas etc)

I feel like I made it pretty binary either "easy = simple" like now via ENVs, or full on "you need to really grok the system" architecture (for per-org runtime safety) but I it might not be too hard to make multi-org work on just one system if the mongo setup could be decoupled

jtreanor commented 5 years ago

Yeah 😄I'll take a look at the multi-org stuff and see if its feasible without too much effort.

Caching Dangerfiles still could be fine for us though. I can live with restarting the instances when they are updated if we don't get the webhooks.

Either way I'll look at implementing some kind of workaround for this issue next week.

jtreanor commented 5 years ago

As expected, this has been resolved by https://github.com/danger/peril/pull/446 and https://github.com/danger/peril/pull/443. However, it is still possible for the 401 errors to happen within Danger-js. I have opened https://github.com/danger/danger-js/pull/898 to add similar retry logic to Danger itself.

Here is a log from our app running master showing this error in Danger:

Request failed [401]: https://api.github.com/repos/wordpress-mobile/gutenberg-mobile/pulls/1240 
Response: { 
  "message": "Bad credentials", 
  "documentation_url": "https://developer.github.com/v3" 
}

danger / peril

getGitHubFileContents fails intermittently with 401 "Bad Credentials" #440