chdsbd / kodiak

🔮 A bot to automatically update and merge GitHub PRs
https://kodiakhq.com
GNU Affero General Public License v3.0
1.03k stars 65 forks source link

Kodiak missing from checks #874

Closed styfle closed 10 months ago

styfle commented 1 year ago

Kodiak is missing from the checks on this PR: https://github.com/vercel/vercel/pull/10133

Even though the automerge label is applied, it seems like it will never merge https://github.com/vercel/vercel/blob/main/.kodiak.toml#L4C1-L4C1

Is kodiak down right now?

screenshot
chdsbd commented 1 year ago

I've been looking at this for a while and I think the issue is the number of web hook events in the installation's queue.

Kodiak has a single queue per installation to process all web hooks. Whenever a PR is merged, Kodiak enqueues all open PRs (up to 100 per repo) to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.

There's currently 321 PRs in the queue for evaluation, so Kodiak will take some time to evaluate those based on GitHub API rate limits.

I wonder if we can filter out PRs more here or increase concurrency: https://github.com/chdsbd/kodiak/blob/d6f1720f7ad7babb2f25283d9bf51c3f7e031a44/bot/kodiak/queue.py#L214-L215

chdsbd commented 1 year ago

I think the move going forward is to update Kodiak to have more tasks working on the web hook queue. If we have rate limit issues, we can figure something else out

styfle commented 1 year ago

up to 100 per repo to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.

Is it repo or org? Because the repo only has 56 open PRs which is less than the "100 per repo".

https://github.com/vercel/vercel/pulls?q=is%3Apr+is%3Aopen

styfle commented 1 year ago

I tested it and confirmed that all 56 return in a single api response:

curl -sSL 'https://api.github.com/repos/vercel/vercel/pulls?per_page=100' | jq '. | length'
chdsbd commented 1 year ago

It’s per organization, so it can add up across repos

styfle commented 1 year ago

Since PRs are technically Issues, you could change the API call from /pulls to /issues and filter by the label.

For example, this returns the 1 open PR that has the pr: automerge label:

https://api.github.com/repos/vercel/vercel/issues?per_page=100&labels=pr%3A+automerge

Now that only works when the repo is configured to include the automerge_label but that could reduce the number in the queue quite a bit.

chdsbd commented 1 year ago

My plan is to increase the concurrency for web hook ingestion

chdsbd commented 1 year ago

I've deployed a change to increase the concurrency for your organization to 5 ingest workers. I'm going to monitor the GitHub API rate limits and increase the worker count if the rate limiting isn't an issue

chdsbd commented 1 year ago

I realized yesterday that Kodiak throttles API requests to 5000/hour by default. Since your organization is on GitHub Enterprise, Kodiak has a 15,000/hour rate limit.

I've deployed a change that enables your organization to utilize the 15k rate limit

styfle commented 1 year ago

Wonderful thanks!

I'll close this issue for now and I can re-open if it happens again 👍

styfle commented 1 year ago

Re-opening since the problem is happening again now.

Kodiak shows nothing in the queue but there is a PR that says its 3rd in the queue for about 12 minutes now.

image image
chdsbd commented 1 year ago

It might be the rate limit, the logging on rate limits is lacking, so I'll turn down the internal Kodiak limit for now

chdsbd commented 1 year ago

I added some better logging, it seems like we're hitting a different rate limit.

I'm seeing errors for GraphQL queries (https://api.github.com/graphql) and check-runs https://api.github.com/repos/vercel/api/check-runs

{
  "documentation_url": "https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits",
  "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again."
}

https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits

chdsbd commented 1 year ago

I think we can update our throttling logic to wait for the "retry-after" time when we hit a secondary rate limit

chdsbd commented 1 year ago

I'm actually unsure here, looking at the /rate_limit API, we have plenty of room. The only header we're receiving on the secondary rate limit error is a retry-after: 60.

I think it may be because we're making many requests concurrently, which the GitHub docs explicitly call out as a no-no 😬 : https://docs.github.com/en/rest/guides/best-practices-for-integrators?apiVersion=2022-11-28#dealing-with-secondary-rate-limits

Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.

We could wait 60 seconds after hitting this limit, but that's going to slow things down

{
  "resources": {
    "core": {
      "limit": 15000,
      "used": 3,
      "remaining": 14997,
      "reset": 1688135146
    },
    "search": {
      "limit": 30,
      "used": 0,
      "remaining": 30,
      "reset": 1688131627
    },
    "graphql": {
      "limit": 12500,
      "used": 120,
      "remaining": 12380,
      "reset": 1688135139
    },
    "integration_manifest": {
      "limit": 5000,
      "used": 0,
      "remaining": 5000,
      "reset": 1688135167
    },
    "source_import": {
      "limit": 100,
      "used": 0,
      "remaining": 100,
      "reset": 1688131627
    },
    "code_scanning_upload": {
      "limit": 1000,
      "used": 0,
      "remaining": 1000,
      "reset": 1688135167
    },
    "actions_runner_registration": {
      "limit": 10000,
      "used": 0,
      "remaining": 10000,
      "reset": 1688135167
    },
    "scim": {
      "limit": 15000,
      "used": 0,
      "remaining": 15000,
      "reset": 1688135167
    },
    "dependency_snapshots": {
      "limit": 100,
      "used": 0,
      "remaining": 100,
      "reset": 1688131627
    },
    "code_search": {
      "limit": 10,
      "used": 0,
      "remaining": 10,
      "reset": 1688131627
    }
  },
  "rate": {
    "limit": 15000,
    "used": 3,
    "remaining": 14997,
    "reset": 1688135146
  }
}
styfle commented 1 year ago

Why 60 seconds? The linked docs say 1 second.

If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request.

cdignam-segment commented 1 year ago

Oh, I was just saying that because that’s what the retry-after header said

styfle commented 1 year ago

I suppose if the rate limit is hit, then yes it will have to wait for the duration of Retry-After.

But maybe setting a delay of 1 sec between requests will ensure the rate limit is never hit?

styfle commented 12 months ago

This one has been ready to merge for about 20 min https://github.com/vercel/vercel/pull/10164

image
chdsbd commented 12 months ago

Thanks for the ping here. I think I need to synchronize the graphql calls so we only make one per installation at a time. It seems like the REST API isn't returning the secondary rate limit errors.

I triggered a refresh of that pull request and it merged (you can do this by editing the pull request title, description, labels, or commits).

Kodiak wasn't actively looking at this PR, I think Kodiak hit the secondary rate limit and dropped the event, because we don't retry on those errors.

My TODOs are:

styfle commented 12 months ago

Now I'm seeing a new error message. Perhaps GitHub's API is down?

image

Reference: https://github.com/vercel/next.js/pull/52178

chdsbd commented 12 months ago

I think that's because of the secondary rate limit, I'm going to try to temporarily reduce the graphql requests by limiting the ingest workers to 1

styfle commented 11 months ago

I'm seeing that error again today:

problem contacting GitHub API with method 'pull_request/update_branch'

image

https://github.com/vercel/next.js/pull/52580

chdsbd commented 11 months ago

Hey @styfle,

Sorry for the delayed reply. Last week after my message above I made a change to requeue pull requests on api rate limits, which I think ensures we no longer drop pull requests for these limits

I’ll take a look at the PR you linked

chdsbd commented 11 months ago

This error is unrelated to this original issue

https://github.com/vercel/next.js/runs/14992559502

Here’s the issue for this error: https://github.com/chdsbd/kodiak/issues/853

I talked to GitHub a while back and the product team there said this was expected behavior

styfle commented 11 months ago

Is this the same bug as https://github.com/chdsbd/kodiak/issues/853 ?

kodiakhq: status — ⚠️ problem contacting GitHub API with method 'pull_request/merge'

image

https://github.com/vercel/next.js/pull/53520

styfle commented 10 months ago

Closing since it seems the remaining issues we've had are the same as https://github.com/chdsbd/kodiak/issues/853