argoproj / gitops-engine

Democratizing GitOps
https://pkg.go.dev/github.com/argoproj/gitops-engine?tab=subdirectories
Apache License 2.0
1.68k stars 252 forks source link

fix: retry on unauthorized error when retrieving resources by gvk #449

Closed leoluz closed 2 years ago

leoluz commented 2 years ago

Signed-off-by: Leonardo Luz Almeida leonardo_almeida@intuit.com

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

leoluz commented 2 years ago

@jessesuen

Based on this change, you are only retrying api discovery. But couldn't the problem happen at any point in time?

This is a really good question that was intriguing me while chasing this bug. In fact for all our internal syncs (you know that we have quite a few a day), this is the only point where ArgoCD gets an Unauthorized error. @crenshaw-dev inspected the audit logs and realized that we are calling this API ~1M times a day. This is by far the most invoked request that we are sending to kube-api. I implemented a very basic cache in this PR just to avoid the same gvk to be called multiple times. We have applications with more than 500 RoleBindings in it. Previously it would call the API 500 times per sync and now it is cached. However I believe that we could be a lot more aggressive and cache this response for the entire controller life cycle. I would be super interested in hearing opinions about this. (cc @alexmt)

I'm curious if your error scenario only happen during controller startup, or if it happens even when a controller has ben running for some time. KIAM is notorious for having auth errors during initial startup of a pod. If errors happen only at that time, if you would benefit from a different approach (e.g. auth check retry at startup instead of every reconciliation).

This is not happening during controller bootstrap. It happens very sporadically actually. However ArgoCD 2.4.6 for some reason increased the occurrence of this type of error. Before we had ~3 Unauthorized a day and now we are having ~20. FYI, we already deployed a patched version with this retry in our servers and the errors are gone.

jessesuen commented 2 years ago

and realized that we are calling this API ~1M times a day

Woah! that is definitely not supposed to happen. Discovery was always intended to be cached. I'm pretty sure we used to be doing this at a package level, or we might have assumed/expected this to happen in underlying k8s client-go. If this is not happening we absolutely should be doing it.

However I believe that we could be a lot more aggressive and cache this response for the entire controller life cycle.

I believe it must refresh as CRDs are introduced and removed. Also if control plane is upgraded from underneath the controller.