argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.05k stars 5.51k forks source link

argocd-application-controller does not timeout when getting child resources for the resource tree #10675

Open amarjayr opened 2 years ago

amarjayr commented 2 years ago

Checklist:

Describe the bug

I recently ran into a case where Argo CD was getting stuck "refreshing" an app. It would never finish refreshing the app and would never show the resource tree in the UI. The application controller OOM frequently which impacted other apps.

To Reproduce

Create an app which includes a resource that generates >100,000 child resources. In my case, I think this was a cert-manager.io/v1 Certificate generating thousands of CertificateRequests (this seems to be something that happens? https://github.com/cert-manager/cert-manager/issues/4846#issue-1132714441).

I suspect #10009 has a similar root cause, but they resolved it with Resource Exclusion/Inclusion. Also maybe #4863, #3864

Expected behavior

A log warning or some sort of error surfaced in the UI. There should also be a limit on the number of resources so the application-controller doesn't OOM (which will impact other apps syncing). Ideally the warning includes the trouble resource.

Would this have been surfaced in any way through the metrics?

Obviously the more important resolution is cleaning up the child resources (which eventually would have overwhelmed etcd), but ideally Argo is able to identify edge cases like this.

Version

argocd: v2.4.9+1ba9008
  BuildDate: 2022-08-11T15:43:48Z
  GitCommit: 1ba9008536b7e61414784811c431cd8da356065e
  GitTreeState: clean
  GoVersion: go1.18.5
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.4.11+3d9e9f2
  BuildDate: 2022-08-22T09:13:16Z
  GitCommit: 3d9e9f2f95b7801b90377ecfc4073e5f0f07205b
  GitTreeState: clean
  GoVersion: go1.18.5
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.1+g5cb9af4
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

Logs

argocd-server
level=error msg="finished unary call with code Unknown" error="error getting cached app state: error getting application by query: application refresh deadline exceeded" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService" grpc.time_ms=60017.48 span.kind=server system=grpc

No logs on the argocd-application-controller that were relevant

If you can give any guidance on what the (new) child resource limit should be or how it should be set, I'm happy to make a PR.

chris93111 commented 2 years ago

this issue is not resolved , after deploy 76 app and Resource Exclusion/Inclusion, the problem appear

prein commented 2 years ago

According to the manual resource inclusions/exclusions are configured by manual edit to the cm. Am I correct? Is it not configurable in declarative way?

chris93111 commented 2 years ago

According to the manual resource inclusions/exclusions are configured by manual edit to the cm. Am I correct? Is it not configurable in declarative way?

Yes you can with argocd operator

mohaldu commented 1 year ago

Bump on this, facing similar pains when using resource exclusion to stop argocd from tracking cilium.

Woytek-Polnik commented 1 year ago

same here

ChrisLanks commented 1 year ago

We are also hitting this issue. Please prioritize.

gmolaire commented 5 months ago

Same here. Bump

ostgardh commented 5 months ago

Same here. Bump. Are the any workarounds for this issue?

Casper-dss commented 4 months ago

Same error here: "error getting cached app managed resources: error getting application by query: application refresh deadline exceeded"

ivan-cai commented 4 months ago

Same error here and applications still refreshing, can not display some resources(such as endpoint for service, pods for deployment): msg="finished unary call with code Unknown" error="error getting cached app resource tree: error getting application by query: application refresh deadline exceeded" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService

image

kknyxkk commented 2 months ago

Same error here, trying to understand why this happens

valkiriaaquatica commented 1 month ago

Same here, after testing a bit and changing timeout values in the values.yaml , my solution was to moved the applications to a new "light" repo. This light repo has less files so maybe the problem in mine was the "heavy" repo was too heavy