BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Improve Warden Operator for GitHub API calls #4842

Closed IanKWatts closed 1 month ago

IanKWatts commented 1 month ago

Describe the issue The Warden Operator makes a lot of GitHub API calls and now often hits the API rate limit. We can improve the efficiency and performance of the operator by reducing redundant API calls, doing sanity checks on GitHub IDs before trying to add them for repo access, and creating new GitHub users to spread out the load.

What is the Value/Impact? Reduce errors and improve the effectiveness of the operator

What is the plan? How will this get completed?

Identify any dependencies n/a

Definition of done

IanKWatts commented 1 month ago

The main problem here, which was that the GitHub APIs were being called so often that we regularly hit the API rate limit, thus breaking the operator, was caused by the way the operator managed the GitOpsTeam resources. A poorly configured GitOpsTeam would result in the operator constantly trying to achieve an impossible state and it would repeatedly make API calls, never getting the desired state to match the actual state.
The operator would fetch the actual state from GitHub and put that in the 'status' of the GitOpsTeam.

Because these are not fatal errors, if a GitOpsTeam is not configured properly, the operator will ignore these differences. If users have an access issue, they should review their GitOpsTeam and ensure that all user IDs are entered correctly. Ultimately, the operator was changed to set the status of the resource to the desired state so that it will match and avoid endless reconciliation. For users entered into two roles, the operator checks for duplicates and uses the role with the highest permission. Should probably check the way the operator handles GitOpsAlliances, too, to ensure that those are handled in the same way. I'll create another ticket for that. Also, leader election settings were increased, due to frequent pod restarts in Emerald and sometimes other clusters. Durations are now 4 times the defaults and this seems to have resolved those errors. API calls and operator activity are back to what they should be - that is, very little activity.