Application controlled hangs and stop processing queue

neizmirasego commented 4 months ago

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

Absolutely randomly application controller hangs and stop processing apps. Not enough observability to identify the root cause. In monitoring workqueue from 0 reaches the total number apps. Only application controller restart helps. In logs there are no error messages. Debug mode is enabled. The only simptom in logs is increasing of grpc.time_ms from ~0.1 to ~2000 in repo server logs. In controller logs we see several "Watch failed", but they also appears when issue is not reprodusible.

Repo server logs:

time="2024-05-30T20:20:31Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Check grpc.service=grpc.health.v1.Health grpc.start_time="2024-05-30T20:20:31Z" grpc.time_ms=0.062 span.kind=server system=grpc
time="2024-05-30T20:21:01Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Check grpc.service=grpc.health.v1.Health grpc.start_time="2024-05-30T20:21:01Z" grpc.time_ms=0.111 span.kind=server system=grpc
time="2024-05-30T20:21:31Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Check grpc.service=grpc.health.v1.Health grpc.start_time="2024-05-30T20:21:31Z" grpc.time_ms=0.031 span.kind=server system=grpc
time="2024-05-30T20:22:01Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Check grpc.service=grpc.health.v1.Health grpc.start_time="2024-05-30T20:22:01Z" grpc.time_ms=0.037 span.kind=server system=grpc
time="2024-05-30T20:22:31Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=Check grpc.service=grpc.health.v1.Health grpc.start_time="2024-05-30T20:22:31Z" grpc.time_ms=0.046 span.kind=server system=grpc
time="2024-05-30T20:22:47Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:*****,TargetRevision:master,Helm:nil,Kustomize:nil,Directory:nil,Plugin:nil,Chart:,Ref:,}/da99b470d5e7fe4737262fb172951a789c43f37c"
time="2024-05-30T20:22:47Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=repository.RepoServerService grpc.start_time="2024-05-30T20:22:45Z" grpc.time_ms=2018.76 span.kind=server system=grpc
time="2024-05-30T20:22:47Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:*****,Path:environments/etbss_ocp_mdc_02_env_1/env-1/argocd_configuration/etslt-project-sd/applications/bss/etslt-smartplug-plugins,TargetRevision:master,Helm:nil,Kustomize:nil,Directory:nil,Plugin:nil,Chart:,Ref:,}/da99b470d5e7fe4737262fb172951a789c43f37c"

Application controller logs:

time="2024-05-30T20:20:27Z" level=info msg="Start watch ScrapeConfig.monitoring.coreos.com on https://172.30.0.1:443" server="https://kubernetes.default.svc"
time="2024-05-30T20:20:27Z" level=info msg="Start watch RoleBinding.rbac.authorization.k8s.io on https://172.30.0.1:443" server="https://kubernetes.default.svc"
E0530 20:20:28.410168       7 retrywatcher.go:130] "Watch failed" err="unknown"
time="2024-05-30T20:20:28Z" level=info msg="Failed to watch RoleBinding.rbac.authorization.k8s.io on https://172.30.0.1:443: Resyncing RoleBinding.rbac.authorization.k8s.io on https://172.30.0.1:443 due to timeout, retrying in 1s" server="https://kubernetes.default.svc"
time="2024-05-30T20:20:28Z" level=info msg="Failed to watch RoleBinding.rbac.authorization.k8s.io on https://172.30.0.1:443: Resyncing RoleBinding.rbac.authorization.k8s.io on https://172.30.0.1:443 due to timeout, retrying in 1s" server="https://kubernetes.default.svc"
time="2024-05-30T20:20:29Z" level=info msg="Failed to watch Secret on https://172.30.0.1:443: Resyncing Secret on https://172.30.0.1:443 due to timeout, retrying in 1s" server="https://kubernetes.default.svc"
E0530 20:20:29.411079       7 retrywatcher.go:130] "Watch failed" err="unknown"
time="2024-05-30T20:20:29Z" level=info msg="Start watch Secret on https://172.30.0.1:443" server="https://kubernetes.default.svc"
time="2024-05-30T20:20:29Z" level=info msg="Start watch RoleBinding.rbac.authorization.k8s.io on https://172.30.0.1:443" server="https://kubernetes.default.svc"
time="2024-05-30T20:20:29Z" level=info msg="Start watch RoleBinding.rbac.authorization.k8s.io on https://172.30.0.1:443" server="https://kubernetes.default.svc"
E0530 20:20:30.410490       7 retrywatcher.go:130] "Watch failed" err="unknown"
E0530 20:20:31.411185       7 retrywatcher.go:130] "Watch failed" err="unknown"
E0530 20:20:32.413151       7 retrywatcher.go:130] "Watch failed" err="unknown"
time="2024-05-30T20:20:33Z" level=debug msg="Checking if cluster https://kubernetes.default.svc with clusterShard 0 should be processed by shard 0"

To Reproduce

Nothing special, just deploy argocd, create applications, source is internal self-hosted gitlab, destination is openshift.

Expected behavior

All allications need to be processed and deployed.

Screenshots

Version

argocd: v2.11.2+25f7504
  BuildDate: 2024-05-23T13:32:13Z
  GitCommit: 25f7504ecc198e7d7fdc055fdb83ae50eee5edd0
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/amd64

haooliveira84 commented 3 months ago

I've related the same issue in here :(

neizmirasego commented 3 months ago

We have enabled more debug logs:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  labels:
    app.kubernetes.io/name: argocd-cmd-params-cm
    app.kubernetes.io/part-of: argocd
data: 
controller.log.level: "debug"

      terminationGracePeriodSeconds: 30
      serviceAccountName: argocd-application-controller
      containers:
      - args:
        - /usr/local/bin/argocd-application-controller
        - --gloglevel 
        - "4"

and found more logs which do not exists on healthy cluster

"Failed to get event! Re-creating the watcher." resourceVersion="301181841"
Restarting RetryWatcher at RV="301181583"
"Watch failed" err="unknown"

Both are from k8s go client.

jenna-foghorn commented 3 months ago

argoproj / argo-cd

Application controlled hangs and stop processing queue #18478