argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.7k stars 5.06k forks source link

waiting for completion of hook and hook never succeds #6880

Open rajivml opened 2 years ago

rajivml commented 2 years ago

HI,

We are seeing this issue quite often where app sync is getting stuck in "waiting for completion of hook" and these hooks are never getting completed

As you can see the below application got stuck on secret creation phase and some how that secret never got created

image

Stripped out all un-necessary details. Now this is how the secret is created and used by the job.

apiVersion: v1
kind: Secret
metadata:
  name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
    helm.sh/hook-weight: "-5"
type: Opaque
data:
  xxxx

apiVersion: batch/v1
kind: Job
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook-weight: "-4"
spec:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
        - name: app-settings
          configMap:
            name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
        - name: app-secrets
          secret:
            secretName: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}

kubectl -n argocd logs argocd-server-768f46f469-j98h6 | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-repo-server-57bdbf899c-9lxhr | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-repo-server-57bdbf899c-7xvs7 | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-server-768f46f469-tqp8p | grep xxx-migrations - No matching logs

[testadmin@server0 ~]$ kubectl -n argocd logs argocd-application-controller-0 | grep orchestrator-migrations time="2021-08-02T02:16:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:16:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:19:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:19:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:22:17Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:22:17Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:22:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:25:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:25:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:28:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:28:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:31:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:31:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx

Environment:

ArgoCD Version: 2.0.1

Please let me know in case of any other info required

imroc commented 2 months ago

The same with envoy-gateway

@JuniorJPDJ Try:

patches:
  - target:
      name: eg-gateway-helm-certgen
      kind: Job
    patch: |
      - path: "/spec/ttlSecondsAfterFinished"
        op: remove
JuniorJPDJ commented 2 months ago

@imroc how am I supposed to use it with helm?

imroc commented 2 months ago

@JuniorJPDJ You need kustomize, if you must use helm, you can combine helm with kustomize, use helmCharts in kustomization.yaml to include envoygateway chart, and also add patches in kustomization.yaml .

JuniorJPDJ commented 2 months ago

I figured it out - envoy gateway allows setting this parameter directly on helm values - no need for kustomize:

      certgen:
        job:
          ttlSecondsAfterFinished: ~

Or set it to 60 and it doesn't bug ArgoCD too.

florentinchaussoy commented 2 months ago

FYI - present in v2.10.7 but not in v2.10.4.

Agnes4Him commented 2 months ago

I'm experiencing the same challenge on production.

Interestingly, I ran an application locally and all the hooks - PreSync, PostSync and SyncFail seem to have worked without tweaking anything. Wondering why????

joelmccoy commented 2 months ago

FYI - Running into the same issue with longhorn on ArgoCD v2.10.1+a79e0ea.
image

edit: found this relevant to the longhorn chart. it seems they added a value that can be set to false in order to work with argocd

jkleinlercher commented 2 months ago

I'm using v2.11.0+f287dab and hit the same problem with whatever version of kube-prometheus-stack from 45.0.0 on. Last testing was with the most recent one, 58.0.0, and still facing the issue. Sadly, no workaround worked for me yet

Maybe a ttl > 0 is needed also in https://github.com/prometheus-community/helm-charts/blob/9c41858ac9714483638d78fb560577dc37e55875/charts/kube-prometheus-stack/templates/prometheus-operator/admission-webhooks/job-patch/job-createSecret.yaml#L19

just according to others who changed the ttl. Maybe someone can shed some light on this why changing the ttl helps in some charts?

sambonbonne commented 2 months ago

@jkleinlercher if I understand the issue, it seems that when the TTL is set to 0 ArgoCD does not have the time to detect the job succedeed and waits for it to finish indefinitely.

jkleinlercher commented 2 months ago

I also asked again in https://cloud-native.slack.com/archives/C01TSERG0KZ/p1714376880925509 about this issue. Would be happy if we could get some experts in to tell if "ttlSecondsAfterFinished: 0" in a hook job is a problem (or under which circumstances it could be a problem). Maybe then some off-the-shelf helm-charts could be reconfigured to solve this problem for everyone.

jkleinlercher commented 2 months ago

I no recognized that the 'ttlSecondsAfterFinished: 0' cannot be the root cause because this is not set in our situation. in kube-prometheus-stack there is a API condition around this setting which is not met in current clusters:

https://github.com/prometheus-community/helm-charts/blob/9c41858ac9714483638d78fb560577dc37e55875/charts/kube-prometheus-stack/templates/prometheus-operator/admission-webhooks/job-patch/job-createSecret.yaml#L17-L20

So there must be another root cause for the "stuck in pending deletion" situation.

jkleinlercher commented 2 months ago

okay locally on k3d cluster I can recreate the problem with kube-prometheus-stack but argocd didn't stuck in "pending deletion" but "running", although the job already finished and is already deleted and

image

kubectl get job -n monitoring sx-kube-prometheus-stack-admission-create
Error from server (NotFound): jobs.batch "sx-kube-prometheus-stack-admission-create" not found

I wonder if ServerSideApply has something to do with it ...

slashr commented 2 months ago

Reporting the same problem on a k3s cluster trying to install the kube-prometheus-stack Helm chart latest version

image
jkleinlercher commented 2 months ago

Meanwhile I came across https://github.com/argoproj/argo-cd/issues/15292 and I wonder if this is the same problem. In

https://github.com/prometheus-community/helm-charts/blob/9c41858ac9714483638d78fb560577dc37e55875/charts/kube-prometheus-stack/templates/prometheus-operator/admission-webhooks/job-patch/job-createSecret.yaml#L9

is the same deletion policy as in the issue mentioned … sadly that the referenced PR https://github.com/argoproj/gitops-engine/pull/461 was never merged. @leoluz or @nazarewk is there some chance to help out on this?

ptr1120 commented 2 months ago

Same problem now using Argo v2.11.0-rc3+20fd621 with current kube-prometheus-stack

NAVRockClimber commented 2 months ago

Same problem here 5 different clusters (K3S, Kubespray & Managed) on none I can deploy the current kube-prometheus-stack (Chart v58.3.1) using Argo-CD 2.10.9

slashr commented 2 months ago

Issue has been fixed

https://github.com/prometheus-community/helm-charts/pull/4510

j809 commented 2 months ago

The job job-createSecret.yaml does not complete syncing in 58.3.2. Setting prometheusOperator.admissionWebhooks.patch.ttlSecondsAfterFinished to 30s helped me solve the problem.

prashant0085 commented 1 month ago

Still facing the issue when trying to deploy kube prom stack helm chart version 58.6.1 where prometheusOperator.admissionWebhooks.patch.ttlSecondsAfterFinished is set to 60 Argocd version: v2.9.0+9cf0c69 image image

jsantosa-minsait commented 1 month ago

Still facing the issue when trying to deploy kube prom stack helm chart version 58.6.1 where prometheusOperator.admissionWebhooks.patch.ttlSecondsAfterFinished is set to 60 Argocd version: v2.9.0+9cf0c69 image image

I have the same issue, same configuration used fo ttlSecondsAfterFinished, tested 60 and 30 seconds.

prometheusOperator:
  enabled: true
  admissionWebhooks:
    patch:
      enabled: true
      ttlSecondsAfterFinished: 30
prashant0085 commented 3 weeks ago

@jsantosa-minsait Is by chance you have enabled istio sidecar injection ? that causes the pod to complete the patching and creation but the pod keeps on running.

jsantosa-minsait commented 3 weeks ago

@jsantosa-minsait Is by chance you have enabled istio sidecar injection ? that causes the pod to complete the patching and creation but the pod keeps on running.

Hi @prashant0085 no, I don't. I have Cilium installed and Kyverno with admissions controllers hooks that may alter or patch the resource. However this is not the case.

ilabrovic commented 1 week ago

Hi, experiencing this issue as well. Environment: Openshift 4.14, Argocd v2.10.10+9b3d0c0, Postsynchook job. Just a simple kustomization.yaml with 2 resources and a postsynchookjob No Helm. The job actually takes about 2 minutes to complete, but argocd only sets the job as finished after approx 10 minutes. Tried different settings (0, 60, 120) on ttlSecondsAfterFinished in the job spec, but no change in behaviour Also monitored the memory and cpu usage of the argocd pods (controller, repo, applicationsetcontroller, etc), no pod even comes close to limitcpu or limitmemory, so no issue there...