Shopify / shipit-engine

Deployment coordination
https://shopify.engineering/introducing-shipit
MIT License
1.41k stars 144 forks source link

Retry background jobs #1353

Closed kartiki975 closed 1 month ago

kartiki975 commented 1 month ago

TLDR

This closes an internal Shopify ticket that suspects that CreateReleaseStatusesJob was failing in creating release statuses due to a potential Gtihub API error response. In actuality, bugsnag errors are related to ConcurrentJobError. Some other open errors are linked to a similar issue in CreateOnGithubJob, primarily involving temporary spikes and some 504 response errors.

In this PR, I decided to drop duplicate background jobs for CreateReleaseStatusesJob, and retry on server errors from Github if that is indeed the case.

Further thought process

These were some recommendations I made initially: Option 1: Implement a retry mechanism for both CreateOnGithubJob and CreateReleaseStatusesJob upon receiving a 504 response.

Option 2: Schedule a periodic job to address ReleaseStatuses and CommitDeploymentStatuses with a nil github_id, ensuring that race conditions do not occur with other concurrently initiated jobs.

Option 1 seems like a good start pointing b/c it targets server errors on Github's end (including the 504 response, Octokit::GatewayTimeout) and is lower effort than Option 2; I initially attempted to target 504 only along with the existing errors but that isn't possible b/c such subclass doesn't exist. If this solution fails, Option 2 or another alternate solution can be tried in a follow up PR.

Request for reviewers

Are you against this code change? Would you recommend another solution?