argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.49k stars 5.32k forks source link

application-controlller `Watch failed` #15464

Open yellowhat opened 1 year ago

yellowhat commented 1 year ago

Checklist:

Describe the bug

Describe the bug

Hi, I am using the argo-cd 5.46.2 helm chart.

I have noticed that every 12 hours the application-controller throws the following error:

 retrywatcher.go:130] "Watch failed" err="context canceled"

According to this discussion some watch permission are missing.

Currently the role associated the application-controller service account has watch on secrets and configmaps:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argo-cd-application-controller
  namespace: argo-cd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argo-cd-application-controller
subjects:
- kind: ServiceAccount
  name: argocd-application-controller
  namespace: argo-cd

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argo-cd-application-controller
  namespace: argo-cd
rules:
- apiGroups:
  - ""
  resources:
  - secrets
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - argoproj.io
  resources:
  - applications
  - appprojects
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create

Is there something else missing?

To Reproduce

kubectl logs argo-cd-application-controller-0 | grep Watch

Expected behavior

No error

Version

$ argocd version
argocd: v2.8.3+77556d9
  BuildDate: 2023-09-07T16:05:43Z
  GitCommit: 77556d9e64304c27c718bb0794676713628e435e
  GitTreeState: clean
  GoVersion: go1.20.6
  Compiler: gc
  Platform: linux/amd64

Logs

E0912 08:10:34.158858       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.158977       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.161448       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.162382       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.158558       7 retrywatcher.go:130] "Watch failed" err="context canceled"
E0912 08:10:34.162246       7 retrywatcher.go:130] "Watch failed" err="context canceled"
mimartin12 commented 10 months ago

I am experiencing the same. Every 12 hours, I get about 40 or so errors that all say err="context canceled" Most of the instance that these errors show are after attempting to sync an externally managed cluster. The cluster does sync eventually, but these errors are initially thrown.

Time | Host
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
-----------------------------
14:39:19 UTC | aks-general-00000-vmss000002-argocd
"Watch failed" err="context canceled"
dmarquez-splunk commented 1 month ago

We are still seeing this issue in argocd 2.11.2 that is causing deployment outages for some of our users. We have 1 installation with multiple controller that manage about 40+ clusters

gdsoumya commented 1 month ago

This might be unrelated but if you are using a limited rbac for argocd app controller instead of the admin rbac with permission to all resources on cluster you might want to either manually put resource inclusions/exclusions or use the respectRBAC feature available to automatically let argocd figure which resources it has access to and needs to monitor/watch.

Ref:

  1. https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion
  2. https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion
colinodell commented 1 week ago

We are also seeing 75-200 of these logs entries from each application controller every 12 hours on v2.11.3. The timing correlates with the cluster's cache age dropping to 0:

image

Here's a zoomed-in look at a 15 minute window:

image

I don't know what this correlation means but thought it might be worth sharing.

AurimasNav commented 5 days ago

this morning found controller log was logging this error for whole night every second:

E0918 05:28:13.841461 7 retrywatcher.go:130] "Watch failed" err="context canceled" E0918 05:28:14.842231 7 retrywatcher.go:130] "Watch failed" err="context canceled" E0918 05:28:15.842669 7 retrywatcher.go:130] "Watch failed" err="context canceled"

there are problems with my argocd, but this does not help to identify the cause