codecov / engineering-team

This is a general repo to use with GH Projects
1 stars 1 forks source link

Investigate gracefully failing if we encount a 502 from GitHub during sync_repos #1757

Open drazisil-codecov opened 3 months ago

drazisil-codecov commented 3 months ago

It appears in some cases, GitHub returns a 502 during the pagination calls to fetch repos (not on the first one, not a rate limit)

Th current behavior appears to be to cancel the entire DB transaction. It would be better if we could log the error and commit what when have so far, providing the customer with at least a partial list of repos, instead of an empty one.

drazisil-codecov commented 2 months ago

https://github.com/codecov/engineering-team/issues/1549 feels related

heather-codecov commented 2 months ago

One customer had found that complete logout and logging back into GitHub might have helped. (And added clarity that before it was not an issue of authorizing my account to their org - that had been authorized).

eliatcodecov commented 2 months ago

Changes in order of importance:

  1. change commit logic so we don't throw out entire repo list when we hit a 502.
  2. Test a smaller page size on GitHub API call and determine impact on sync times. If acceptable, consider rollout.
matt-codecov commented 2 months ago

the 502s are a problem on their own which will hopefully be solved with https://github.com/codecov/shared/pull/232 and https://github.com/codecov/shared/pull/233

as for gracefully failing in sync_repos: we already commit after each repo insert https://github.com/codecov/worker/blob/1beef847cdd951cd084adf4632893c39f4c442e9/tasks/sync_repos.py#L436 so we don't throw repo data away on error. what we can improve is how we update owner.permission, the list of private repos that a user can see. https://github.com/codecov/worker/pull/478 does that

previously, SoftTimeLimitExceeded and TorngitClientError would cause that list to be totally erased, and TorngitServerFailureError (5xx errors) was uncaught so the owner.permission list would be left alone. now, for all three errors, we log the error, note that the list of private repos in owner.permission may be incomplete, and then let the task update it and finish normally.

for SoftTimeLimitExceeded and TorngitClientError, this is better behavior. you can at least see some repos, rather than none. for TorngitServerFailureError, it's kind of a mixed bag. in the old behavior, new repos would not appear but old ones would not disappear (even if they should). in the new behavior, new repos may appear, but old ones may disappear (even if they shouldn't). i think the new behavior is still better on balance

matt-codecov commented 1 month ago

status update: two of the three PRs shipped, and it seems the problem was solved for the customer who reported tons of 5XXs blocking their usage of the service

the list_repos page size experiment was just updated to 100%, so 50% control and 50% test. no regressions observed so far, so will probably roll out. also no benefits observed, but 5XXs aside from the one customer are very rare so we can't effectively measure benefits anyway

leaving this issue open until the experiment is cleaned up

matt-codecov commented 1 week ago

aiaiai. finished rolling out the new page size of 50, and then discovered a user for whom 50 is not low enough but 20 is. so, going to set the experiment back up with 50 as the new control and 20 as the new test group. stay tuned...