Open nazarewk opened 2 years ago
I've tracked down some of the responsible code, but found nothing wrong with it:
There is also Kubernetes side code for our version (EKS 1.20.5):
Generally it seems like the only way:
ExternalSecret
(and by proxy a Secret
) unless PreSync
was finishedPreSync
completed if all Job
s were completed
Job
was just created it could not be completedJob
completed if there was at least 1 Pod
that has finished pinned to itPod
finished when all containers both started and finished
Hi, can you please add the definition (YAML) of the PreSync job?
Hi, can you please add the definition (YAML) of the PreSync job?
done, you can find it collapsed at the bottom of description
Looks like the issue happens less frequently after putting Argo CD on half number of twice as large instances and adjusting reservations and limits:
I reconfigured the cluster around 13-15 on Monday 25th, then there were some trailing issues until 18 and almost nothing after that.
Might be related to some race conditions when performance is sub-optimal?
Could it be that events are received/processed out of order?
Another case of a hiccup. I suspect Argo CD application controller might not be handling cache properly under load? Seems like it thinks it has a job from 3 seconds before (before deletion due to hook deletion policy)?
seems like this is correlated with higher CPU usage on Application Controller:
Continuing on previous one there is patch
on Application
just before deletion of Secret, I don't have content of the patch, but I'm pretty sure it could be status update for PreSync completed.
maybe we could clear cache in here? https://github.com/argoproj/gitops-engine/blob/2bc3fef13e0712cf177ba6cbcfb52283f3d9ca73/pkg/sync/sync_context.go#L1085-L1112
this might be caused by hardcoded (low amount of) processors, see https://github.com/argoproj/argo-cd/pull/10458 for a fix
this is heavily related to:
note: seems as if the issue stopped happening after we switched to rendering helm templates ahead of time (committing manifests to git + using directory
source
Checklist:
argocd version
.Describe the bug
I've noticed a weird behavior blocking our deployments every now and then (couple times a day), where ArgoCD considers
PreSync
hook finished even thoughJob
was just created. TheJob
sPod
hangs up on lack ofSecret
(throughExternalSecret
), that was deleted becausePreSync
supposedly finished.I've asked about it on Slack at https://cloud-native.slack.com/archives/C01TSERG0KZ/p1658409808237339 , but didn't find the cause.
Example timeline
Extra info:
Seems more or less related to:
To Reproduce
No idea, it happens randomly in the ballpark of once every 10 deployments
Expected behavior
Argo CD does not consider
PreSync
phase finished until theJob
has started and completed (by either success or failure).Version image:
quay.io/argoproj/argocd:v2.3.3
, not sure why argocd version posts 2.4.4 (it's the CLI's version).Logs
Google Sheet gathering following information:
Job manifest
```yaml apiVersion: batch/v1 kind: Job metadata: name: fa-app-migrate-kt labels: app.kubernetes.io/instance: fa-app app.kubernetes.io/managed-by: Helm app.kubernetes.io/version: "git-22eREDACTED0d5" helm.sh/chart: app-REDACTED tags.datadoghq.com/env: "pr-prnum-ns" tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns" tags.datadoghq.com/service: "fa-app-migrate-kt" tags.datadoghq.com/version: "git-22eREDACTED0d5" tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt" tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5" image-tag: "git-22eREDACTED0d5" job-type: init annotations: helm.sh/hook: pre-install,pre-upgrade helm.sh/hook-delete-policy: before-hook-creation argocd.argoproj.io/hook-delete-policy: BeforeHookCreation helm.sh/hook-weight: "20" spec: backoffLimit: 5 template: metadata: labels: job-type: init app.kubernetes.io/instance: fa-app app.kubernetes.io/managed-by: Helm app.kubernetes.io/version: "git-22eREDACTED0d5" helm.sh/chart: app-REDACTED tags.datadoghq.com/env: "pr-prnum-ns" tags.datadoghq.com/fa-app-migrate-kt.env: "pr-prnum-ns" tags.datadoghq.com/service: "fa-app-migrate-kt" tags.datadoghq.com/version: "git-22eREDACTED0d5" tags.datadoghq.com/fa-app-migrate-kt.service: "fa-app-migrate-kt" tags.datadoghq.com/fa-app-migrate-kt.version: "git-22eREDACTED0d5" annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false" ad.datadoghq.com/fa-app-migrate-kt-copy.logs: "[{\n \"source\": \"km-jobs\"\n}]\n" ad.datadoghq.com/fa-app-migrate-ktm.logs: "[{\n \"source\": \"km-jobs\"\n}]\n" spec: restartPolicy: Never initContainers: - name: fa-app-migrate-kt-copy image: REDACTED/fa-app:git-22eREDACTED0d5 command: [REDACTED] env: - name: APP_NAME value: fa-app-migrate-kt envFrom: - configMapRef: name: fa-app-hooks - secretRef: name: fa-app-hooks securityContext: allowPrivilegeEscalation: false volumeMounts: - name: kc mountPath: /kc containers: - name: fa-app-migrate-ktm image: "REDACTED/km:git-522c04babb111b298d6a897cf12960eb35868082" command: [REDACTED] env: - name: APP_NAME value: fa-app-migrate-kt - name: KT_FILE value: kc/kt.yaml envFrom: - configMapRef: name: fa-app-hooks - secretRef: name: fa-app-hooks resources: limits: cpu: 1 memory: 512Mi requests: cpu: 0.5 memory: 512Mi securityContext: allowPrivilegeEscalation: false volumeMounts: - name: kc mountPath: /usr/src/app/kc nodeSelector: tolerations: volumes: - name: kc emptyDir: {} ```