argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.95k stars 5.46k forks source link

Argo CD considers `PreSync` phase finished even though the Job was just created #10077

Open nazarewk opened 2 years ago

nazarewk commented 2 years ago

Checklist:

Describe the bug

I've noticed a weird behavior blocking our deployments every now and then (couple times a day), where ArgoCD considers PreSync hook finished even though Job was just created. The Jobs Pod hangs up on lack of Secret (through ExternalSecret), that was deleted because PreSync supposedly finished.

I've asked about it on Slack at https://cloud-native.slack.com/archives/C01TSERG0KZ/p1658409808237339 , but didn't find the cause.

Example timeline

  1. 10:36:54 Argo CD created External Secret, ESO created a Secret
  2. 10:37:02 Argo dropped Job
  3. 10:37:04 Argo created a new Job
  4. 10:37:07 Argo deleted ExternalSecret
    1. Secret was garbage collected, then ES recreated it for a split second
  5. 14:12:05 Argo cleaned up the Job (I might have terminated the Sync manually)

Extra info:

Seems more or less related to:

To Reproduce

No idea, it happens randomly in the ballpark of once every 10 deployments

Expected behavior

Argo CD does not consider PreSync phase finished until the Job has started and completed (by either success or failure).

Version image: quay.io/argoproj/argocd:v2.3.3, not sure why argocd version posts 2.4.4 (it's the CLI's version).

Logs

Google Sheet gathering following information:

Job manifest ```yaml apiVersion: batch/v1 kind: Job metadata: name: fa-app-migrate-kt labels: app.kubernetes.io/instance: fa-app app.kubernetes.io/managed-by: Helm app.kubernetes.io/version: "git-22eREDACTED0d5" helm.sh/chart: app-REDACTED tags.datadoghq.com/env: "pr-prnum-ns" tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns" tags.datadoghq.com/service: "fa-app-migrate-kt" tags.datadoghq.com/version: "git-22eREDACTED0d5" tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt" tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5" image-tag: "git-22eREDACTED0d5" job-type: init annotations: helm.sh/hook: pre-install,pre-upgrade helm.sh/hook-delete-policy: before-hook-creation argocd.argoproj.io/hook-delete-policy: BeforeHookCreation helm.sh/hook-weight: "20" spec: backoffLimit: 5 template: metadata: labels: job-type: init app.kubernetes.io/instance: fa-app app.kubernetes.io/managed-by: Helm app.kubernetes.io/version: "git-22eREDACTED0d5" helm.sh/chart: app-REDACTED tags.datadoghq.com/env: "pr-prnum-ns" tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns" tags.datadoghq.com/service: "fa-app-migrate-kt" tags.datadoghq.com/version: "git-22eREDACTED0d5" tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt" tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5" annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false" ad.datadoghq.com/fa-app-migrate-kt-copy.logs: "[{\n \"source\": \"km-jobs\"\n}]\n" ad.datadoghq.com/fa-app-migrate-ktm.logs: "[{\n \"source\": \"km-jobs\"\n}]\n" spec: restartPolicy: Never initContainers: - name: fa-app-migrate-kt-copy image: REDACTED/fa-app:git-22eREDACTED0d5 command: [REDACTED] env: - name: APP_NAME value: fa-app-migrate-kt envFrom: - configMapRef: name: fa-app-hooks - secretRef: name: fa-app-hooks securityContext: allowPrivilegeEscalation: false volumeMounts: - name: kc mountPath: /kc containers: - name: fa-app-migrate-ktm image: "REDACTED/km:git-522c04babb111b298d6a897cf12960eb35868082" command: [REDACTED] env: - name: APP_NAME value: fa-app-migrate-kt - name: KT_FILE value: kc/kt.yaml envFrom: - configMapRef: name: fa-app-hooks - secretRef: name: fa-app-hooks resources: limits: cpu: 1 memory: 512Mi requests: cpu: 0.5 memory: 512Mi securityContext: allowPrivilegeEscalation: false volumeMounts: - name: kc mountPath: /usr/src/app/kc nodeSelector: tolerations: volumes: - name: kc emptyDir: {} ```
nazarewk commented 2 years ago

I've tracked down some of the responsible code, but found nothing wrong with it:

nazarewk commented 2 years ago

There is also Kubernetes side code for our version (EKS 1.20.5):

nazarewk commented 2 years ago

Generally it seems like the only way:

  1. Argo CD would not delete ExternalSecret (and by proxy a Secret) unless PreSync was finished
  2. Argo CD would consider PreSync completed if all Jobs were completed
    • Job was just created it could not be completed
  3. Kubernetes would consider Job completed if there was at least 1 Pod that has finished pinned to it
  4. Kubernetes would consider Pod finished when all containers both started and finished
    • could not have happened, because Pod was just created
OmerKahani commented 2 years ago

Hi, can you please add the definition (YAML) of the PreSync job?

nazarewk commented 2 years ago

Hi, can you please add the definition (YAML) of the PreSync job?

done, you can find it collapsed at the bottom of description

nazarewk commented 2 years ago

Looks like the issue happens less frequently after putting Argo CD on half number of twice as large instances and adjusting reservations and limits: screenshot-2022-07-28_09-39-54 I reconfigured the cluster around 13-15 on Monday 25th, then there were some trailing issues until 18 and almost nothing after that. Might be related to some race conditions when performance is sub-optimal?

nazarewk commented 2 years ago

Could it be that events are received/processed out of order?

nazarewk commented 2 years ago

Another case of a hiccup. I suspect Argo CD application controller might not be handling cache properly under load? Seems like it thinks it has a job from 3 seconds before (before deletion due to hook deletion policy)?

screenshot-2022-08-12_12-42-15

seems like this is correlated with higher CPU usage on Application Controller: screenshot-2022-08-12_12-46-51

nazarewk commented 2 years ago

Continuing on previous one there is patch on Application just before deletion of Secret, I don't have content of the patch, but I'm pretty sure it could be status update for PreSync completed. screenshot-2022-08-12_12-56-43

nazarewk commented 2 years ago

maybe we could clear cache in here? https://github.com/argoproj/gitops-engine/blob/2bc3fef13e0712cf177ba6cbcfb52283f3d9ca73/pkg/sync/sync_context.go#L1085-L1112

nazarewk commented 2 years ago

this might be caused by hardcoded (low amount of) processors, see https://github.com/argoproj/argo-cd/pull/10458 for a fix

nazarewk commented 2 years ago

this is heavily related to:

  1. https://github.com/argoproj/argo-cd/blob/9fac0f6ae6e52d6f4978a1eaaf51fbffb9c0958a/controller/sync.go#L465-L485 - there is a fix suggested
  2. https://github.com/argoproj/argo-cd/issues/4669 - original issue
  3. https://github.com/argoproj/argo-cd/pull/4715 - "good-enough" fix for the issue
nazarewk commented 2 years ago

note: seems as if the issue stopped happening after we switched to rendering helm templates ahead of time (committing manifests to git + using directory source