Argo CD considers `PreSync` phase finished even though the Job was just created

nazarewk commented 2 years ago

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[ ] I've included steps to reproduce the bug.
[X] I've pasted the output of argocd version.

Describe the bug

I've noticed a weird behavior blocking our deployments every now and then (couple times a day), where ArgoCD considers PreSync hook finished even though Job was just created. The Jobs Pod hangs up on lack of Secret (through ExternalSecret), that was deleted because PreSync supposedly finished.

I've asked about it on Slack at https://cloud-native.slack.com/archives/C01TSERG0KZ/p1658409808237339 , but didn't find the cause.

Example timeline

10:36:54 Argo CD created External Secret, ESO created a Secret
10:37:02 Argo dropped Job
10:37:04 Argo created a new Job
10:37:07 Argo deleted ExternalSecret
1. Secret was garbage collected, then ES recreated it for a split second
14:12:05 Argo cleaned up the Job (I might have terminated the Sync manually)

Extra info:

the Pod (and Job) got stuck indefinitely waiting for a Secret
i'm on Argo 2.3.3
ExternalSecret is created in wave -2, Job is at 20

Seems more or less related to:

To Reproduce

No idea, it happens randomly in the ballpark of once every 10 deployments

Expected behavior

Argo CD does not consider PreSync phase finished until the Job has started and completed (by either success or failure).

Version image: quay.io/argoproj/argocd:v2.3.3, not sure why argocd version posts 2.4.4 (it's the CLI's version).

Logs

Google Sheet gathering following information:

Kubernetes APIServer logs:
- all logs related to the application
- filtered down to creations & modifications
- filtered down to create/delete
- extracted suspicious part of logs
example ExternalSecret manifest
Application Controller logs

Job manifest

```yaml apiVersion: batch/v1 kind: Job metadata: name: fa-app-migrate-kt labels: app.kubernetes.io/instance: fa-app app.kubernetes.io/managed-by: Helm app.kubernetes.io/version: "git-22eREDACTED0d5" helm.sh/chart: app-REDACTED tags.datadoghq.com/env: "pr-prnum-ns" tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns" tags.datadoghq.com/service: "fa-app-migrate-kt" tags.datadoghq.com/version: "git-22eREDACTED0d5" tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt" tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5" image-tag: "git-22eREDACTED0d5" job-type: init annotations: helm.sh/hook: pre-install,pre-upgrade helm.sh/hook-delete-policy: before-hook-creation argocd.argoproj.io/hook-delete-policy: BeforeHookCreation helm.sh/hook-weight: "20" spec: backoffLimit: 5 template: metadata: labels: job-type: init app.kubernetes.io/instance: fa-app app.kubernetes.io/managed-by: Helm app.kubernetes.io/version: "git-22eREDACTED0d5" helm.sh/chart: app-REDACTED tags.datadoghq.com/env: "pr-prnum-ns" tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns" tags.datadoghq.com/service: "fa-app-migrate-kt" tags.datadoghq.com/version: "git-22eREDACTED0d5" tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt" tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5" annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false" ad.datadoghq.com/fa-app-migrate-kt-copy.logs: "[{\n \"source\": \"km-jobs\"\n}]\n" ad.datadoghq.com/fa-app-migrate-ktm.logs: "[{\n \"source\": \"km-jobs\"\n}]\n" spec: restartPolicy: Never initContainers: - name: fa-app-migrate-kt-copy image: REDACTED/fa-app:git-22eREDACTED0d5 command: [REDACTED] env: - name: APP_NAME value: fa-app-migrate-kt envFrom: - configMapRef: name: fa-app-hooks - secretRef: name: fa-app-hooks securityContext: allowPrivilegeEscalation: false volumeMounts: - name: kc mountPath: /kc containers: - name: fa-app-migrate-ktm image: "REDACTED/km:git-522c04babb111b298d6a897cf12960eb35868082" command: [REDACTED] env: - name: APP_NAME value: fa-app-migrate-kt - name: KT_FILE value: kc/kt.yaml envFrom: - configMapRef: name: fa-app-hooks - secretRef: name: fa-app-hooks resources: limits: cpu: 1 memory: 512Mi requests: cpu: 0.5 memory: 512Mi securityContext: allowPrivilegeEscalation: false volumeMounts: - name: kc mountPath: /usr/src/app/kc nodeSelector: tolerations: volumes: - name: kc emptyDir: {} ```

nazarewk commented 2 years ago

I've tracked down some of the responsible code, but found nothing wrong with it:

nazarewk commented 2 years ago

There is also Kubernetes side code for our version (EKS 1.20.5):

nazarewk commented 2 years ago

Generally it seems like the only way:

Argo CD would not delete ExternalSecret (and by proxy a Secret) unless PreSync was finished
Argo CD would consider PreSync completed if all Jobs were completed
- Job was just created it could not be completed
Kubernetes would consider Job completed if there was at least 1 Pod that has finished pinned to it
Kubernetes would consider Pod finished when all containers both started and finished
- could not have happened, because Pod was just created

OmerKahani commented 2 years ago

Hi, can you please add the definition (YAML) of the PreSync job?

nazarewk commented 2 years ago

Hi, can you please add the definition (YAML) of the PreSync job?

done, you can find it collapsed at the bottom of description

nazarewk commented 2 years ago

Looks like the issue happens less frequently after putting Argo CD on half number of twice as large instances and adjusting reservations and limits: screenshot-2022-07-28_09-39-54 I reconfigured the cluster around 13-15 on Monday 25th, then there were some trailing issues until 18 ~~and almost nothing after that~~. Might be related to some race conditions when performance is sub-optimal?

nazarewk commented 2 years ago

Could it be that events are received/processed out of order?

nazarewk commented 2 years ago

Another case of a hiccup. I suspect Argo CD application controller might not be handling cache properly under load? Seems like it thinks it has a job from 3 seconds before (before deletion due to hook deletion policy)?

screenshot-2022-08-12_12-42-15

seems like this is correlated with higher CPU usage on Application Controller: screenshot-2022-08-12_12-46-51

nazarewk commented 2 years ago

Continuing on previous one there is patch on Application just before deletion of Secret, I don't have content of the patch, but I'm pretty sure it could be status update for PreSync completed. screenshot-2022-08-12_12-56-43

nazarewk commented 2 years ago

maybe we could clear cache in here? https://github.com/argoproj/gitops-engine/blob/2bc3fef13e0712cf177ba6cbcfb52283f3d9ca73/pkg/sync/sync_context.go#L1085-L1112

nazarewk commented 2 years ago

this might be caused by hardcoded (low amount of) processors, see https://github.com/argoproj/argo-cd/pull/10458 for a fix

nazarewk commented 2 years ago

this is heavily related to:

https://github.com/argoproj/argo-cd/blob/9fac0f6ae6e52d6f4978a1eaaf51fbffb9c0958a/controller/sync.go#L465-L485 - there is a fix suggested
https://github.com/argoproj/argo-cd/issues/4669 - original issue
https://github.com/argoproj/argo-cd/pull/4715 - "good-enough" fix for the issue

nazarewk commented 2 years ago

note: seems as if the issue stopped happening after we switched to rendering helm templates ahead of time (committing manifests to git + using directory source

argoproj / argo-cd

Argo CD considers `PreSync` phase finished even though the Job was just created #10077