argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.69k stars 5.06k forks source link

waiting for completion of hook and hook never succeds #6880

Open rajivml opened 2 years ago

rajivml commented 2 years ago

HI,

We are seeing this issue quite often where app sync is getting stuck in "waiting for completion of hook" and these hooks are never getting completed

As you can see the below application got stuck on secret creation phase and some how that secret never got created

image

Stripped out all un-necessary details. Now this is how the secret is created and used by the job.

apiVersion: v1
kind: Secret
metadata:
  name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
    helm.sh/hook-weight: "-5"
type: Opaque
data:
  xxxx

apiVersion: batch/v1
kind: Job
  annotations:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook-weight: "-4"
spec:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
        - name: app-settings
          configMap:
            name: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}
        - name: app-secrets
          secret:
            secretName: {{ include "xxx.fullname" . }}-migrations-{{ .Chart.AppVersion }}

kubectl -n argocd logs argocd-server-768f46f469-j98h6 | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-repo-server-57bdbf899c-9lxhr | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-repo-server-57bdbf899c-7xvs7 | grep xxx-migrations - No matching logs kubectl -n argocd logs argocd-server-768f46f469-tqp8p | grep xxx-migrations - No matching logs

[testadmin@server0 ~]$ kubectl -n argocd logs argocd-application-controller-0 | grep orchestrator-migrations time="2021-08-02T02:16:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:16:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:19:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:19:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:22:17Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:22:17Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:22:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:25:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:25:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:28:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:28:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:31:25Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx time="2021-08-02T02:31:26Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook /Secret/xxx-migrations-0.0.19-private4.1784494" application=xxx

Environment:

ArgoCD Version: 2.0.1

Please let me know in case of any other info required

rajivml commented 2 years ago

I have terminated the app sync and re-synced it again and the sync is successful now but this can't happen because if it happens CI / CD runs and also the automation that we have done to install apps via argoCD CLI would fail.

alexmt commented 2 years ago

I suspect this is fixed by https://github.com/argoproj/argo-cd/pull/6294 . The fix is available in https://github.com/argoproj/argo-cd/releases/tag/v2.0.3 . Can you try upgrading please?

rajivml commented 2 years ago

sure, thanks, we recently upgraded our develop to to use 2.0.5 and this happened on our prod build which is on 2.0.1. I will see if this can repro on our dev branch. Thanks !

om3171991 commented 2 years ago

@alexmt - We are using the below version of ArgoCD and seeing the same issue with Contour helm. Application is waiting for PreSync Job to complete whereas on a cluster I can see the job is completed.

{ "Version": "v2.1.3+d855831", "BuildDate": "2021-09-29T21:51:21Z", "GitCommit": "d855831540e51d8a90b1006d2eb9f49ab1b088af", "GitTreeState": "clean", "GoVersion": "go1.16.5", "Compiler": "gc", "Platform": "linux/amd64", "KsonnetVersion": "v0.13.1", "KustomizeVersion": "v4.2.0 2021-06-30T22:49:26Z", "HelmVersion": "v3.6.0+g7f2df64", "KubectlVersion": "v0.21.0", "JsonnetVersion": "v0.17.0" }

illagrenan commented 2 years ago

I have the same problem in version 2.2.0.

pseymournutanix commented 2 years ago

I have the same problem on the 2.3.0 RC1 as well

jaydipdave commented 2 years ago

The PreSync hook, PostSync hook, and "Syncing" (while No Operation Running) are the only long pending major issues in ArgoCD at the moment.

aslamkhan-dremio commented 2 years ago

Hello. I am still seeing this in v2.2.4. PreSync hook is scheduled, Job starts, runs to completion, Argo sits there spinning "Progressing" until terminated. To work around it, we are terminating the op and using 'sync --strategy=apply' (disabling the hook) and running our job out of band.

Kube events during the sync confirm the job success. I no longer see the job/pod (per those events) if I check the namespace directly.

LAST SEEN TYPE REASON OBJECT MESSAGE 22m Normal Scheduled pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Successfully assigned dcs-prodemea-ns/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l to gke-service-nap-e2-standard-8-1oj503q-5bf9adda-f9t6 22m Normal Pulling pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Pulling image "gcr.io/dremio-1093/accept-release:v3" 21m Normal Pulled pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Successfully pulled image "gcr.io/dremio-1093/accept-release:v3" in 48.040095979s 21m Normal Created pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Created container dcs-artifact 21m Normal Started pod/dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l Started container dcs-artifact 22m Normal SuccessfulCreate job/dcs-artifact-promoter0ba458b-presync-1645132298 Created pod: dcs-artifact-promoter0ba458b-presync-1645132298-hqm9l

image

Let me know if I can provide any diagnostics to help.

MariaJohny commented 2 years ago

We face the same issue in 2.2.5 as well.

MariaJohny commented 2 years ago

I suspect this is fixed by #6294 . The fix is available in https://github.com/argoproj/argo-cd/releases/tag/v2.0.3 . Can you try upgrading please?

Does it work with 2.0.3 or 2.2.2?

ceguimaraes commented 2 years ago

I can confirm the error was fixed on 2.0.3. We recently upgraded to 2.3.3 and we are experiencing the error again.

yuha0 commented 2 years ago

We started experiencing this issue after upgrading to 2.3.3. Before that we were on 2.2.3. I am not 100% sure but I do not recall we had any issue with 2.2.3.

warmfusion commented 2 years ago

We're seeing a similar issue on the syncfailed hook which means we can't actually terminate the sync action.

The job doesn't exist in the target namespace, and we've tried to trick argo by creating a job with the same name, namespace, and annotations as we'd expect to see with a simple echo "done' action but nothing is helping.

image

ArgoCD Version;

{"Version":"v2.3.4+ac8b7df","BuildDate":"2022-05-18T11:41:37Z","GitCommit":"ac8b7df9467ffcc0920b826c62c4b603a7bfed24","GitTreeState":"clean","GoVersion":"go1.17.10","Compiler":"gc","Platform":"linux/amd64","KsonnetVersion":"v0.13.1","KustomizeVersion":"v4.4.1 2021-11-11T23:36:27Z","HelmVersion":"v3.8.0+gd141386","KubectlVersion":"v0.23.1","JsonnetVersion":"v0.18.0"}

margueritepd commented 1 year ago

To add some information here, we are running into the same issue ("waiting for completion of hook" when the hook has already completed), and it happens when we are attempting to sync to a revision that is not the targetRevision for the app. When we sync an app with hooks to the same revision as the targetRevision, we do not run into this.

expand for argo version ``` { "Version": "v2.3.4+ac8b7df", "BuildDate": "2022-05-18T11:41:37Z", "GitCommit": "ac8b7df9467ffcc0920b826c62c4b603a7bfed24", "GitTreeState": "clean", "GoVersion": "go1.17.10", "Compiler": "gc", "Platform": "linux/amd64", "KsonnetVersion": "v0.13.1", "KustomizeVersion": "v4.4.1 2021-11-11T23:36:27Z", "HelmVersion": "v3.8.0+gd141386", "KubectlVersion": "v0.23.1", "JsonnetVersion": "v0.18.0" } ```

We are running 2 application-controller replicas in HA setup as per https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/ . I have verified we do not have a leftover instance of argo before it used stateful-sets.

lacarvalho91 commented 1 year ago

i had a similar problem when i was configuring resource inclusions, i wrote down what happened here: https://github.com/argoproj/argo-cd/issues/10756#issuecomment-1265488657

pseymournutanix commented 1 year ago

I am still seeing this with 2.5.0-rc1

cscorley commented 1 year ago

We resolved this symptom on v2.4.12+41f54aa for Apps that had many Pods by adding a resource exclusion along these lines to our config map:

data:
  resource.exclusions: |
    - apiGroups:
        - '*'
      kinds:
        - 'Pod'
      clusters:
        - '*'

Prior to this, we would have pre-sync job hooks never completing in the ArgoCD UI, but would have actually be completed in Kubernetes. Sometimes, invalidating the cluster cache would help Argo recognize the job was completed, but most of the time not.

We believe the timeouts were related to needing to enumerate an excessive amount of entities and just simply never could finish before the next status refresh occurred. We do not utilize viewing the status of Pods through ArgoCD UI, so this solution is fine for us. Bonus factor for us is that the UI is much more robust now as well 🙂

DasJayYa commented 1 year ago

We had this issue and it was relating to a customers Job failing to initialise due to a bad secret mounting. You can validate this by checking the events in the namespace the job is being spun up to see if its failing to create.

dejanzele commented 1 year ago

Hello Argo community :)

I am fairly familiar with ArgoCD codebase and API, and I'd happily try to repay you for building such an awesome project by trying to have a stab at this issue, if there are no objections?

pritam-acquia commented 1 year ago

Hello Argo community :)

I am fairly familiar with ArgoCD codebase and API, and I'd happily try to repay you for building such an awesome project by trying to have a stab at this issue, if there are no objections?

I will highly appreciate..!

williamcodes commented 1 year ago

I would also highly appreciate that!

vumdao commented 1 year ago

I'm seeing this issue with v2.6.1+3f143c9

linuxbsdfreak commented 1 year ago

I am also seeing this issue when installing kubevela with argocd with the version v2.6.1+3f143c9

    message: >-
      waiting for completion of hook
      /ServiceAccount/kube-vela-vela-core-admission and 3 more hooks
micke commented 1 year ago

We also had this issue and it was resolved once we set ARGOCD_CONTROLLER_REPLICAS.

Instructions here: https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller

If the controller is managing too many clusters and uses too much memory then you can shard clusters across multiple controller replicas. To enable sharding increase the number of replicas in argocd-application-controller StatefulSet and repeat number of replicas in ARGOCD_CONTROLLER_REPLICAS environment variable. The strategic merge patch below demonstrates changes required to configure two controller replicas.

boedy commented 1 year ago

I rolled back from 1.6.1 to 1.5.10. Both versions keep waiting for completion of hook, which has already successfully completed.

I also tried @micke's recommendation (changing the ARGOCD_CONTROLLER_REPLICAS from 1 to 3). Doesn't make a difference unfortunately.

linuxbsdfreak commented 1 year ago

In my case i am only installing the application on a single cluster. That is the only application that is failing

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kube-vela
  annotations:
    argocd.argoproj.io/sync-wave: "10"
  finalizers:
  - resources-finalizer.argocd.argoproj.io
  namespace: argocd
spec:
  destination:
    namespace: vela-system
    name: in-cluster
  project: default
  source:
    chart: vela-core
    repoURL: https://kubevelacharts.oss-accelerate.aliyuncs.com/core
    targetRevision: 1.7.3
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
     - ApplyOutOfSyncOnly=true
     - CreateNamespace=true
     - PruneLast=true
     - ServerSideApply=true
     - Validate=true
     - Replace=true
    retry:
      limit: 30
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m0s
boedy commented 1 year ago

I just figured out what was causing Argo to freeze on the hook. In my case the specific hook had ttlSecondsAfterFinished: 0 defined in the spec. Through Kustomize I removed this field:

patches:
  - target:
      name: pre-hook
      kind: Job
    path: patches/hook.yaml
# patches/hook.yaml
- path: "/spec/ttlSecondsAfterFinished"
  op: remove

Afterwards the chart finally went through! It's still a bug that should be addressed, I'm just sharing this for others to work around it.

zfrhv commented 1 year ago

I had this problem when i had CR which CRD is still not created, and a job with Sync hook

so argocd couldnt apply the custom resource because there was no crd yet, and the hook started and then disappeared. I guess because argo retries to sync the CR and it also restarts the hook somehow (btw I was using SkiDryRunOnMissingResource for the CR)

so I just did that the hook will be PostSync. the CR was retrying until the crd were created, and only after the CR was successfully created then the PostSync hook started and completed successfully

adlnc commented 1 year ago

Encountered similar behavior as described in this issue while upgrading from v2.5.12+9cd67b1 to v2.6.3+e05298b pre-upgrade hooks on different applications with various number of pods and jobs had same symptoms. Sync operation is running forever. Had the feeling this random event seems to appear more frequently while using the argo cli.

chaehni commented 1 year ago

Observing the same issue on v2.6.2. Post-Sync hook never completes even though the corresponding pod exited successfully.

Tronix117 commented 1 year ago

Same here on 2.6.4

0Bu commented 1 year ago

Same with 2.6.5, hitting terminate throws an error Unable to terminate operation. No operation is in progress

mjnovice commented 1 year ago

See it with 2.5 as well.

Tarasovych commented 1 year ago

Same issue with v2.7.3+e7891b8.dirty

asaf400 commented 11 months ago

Same issue in v1.3.3.7 and also in version v6.9.. This issue was opened on Aug 2, 2021, we are now at 2023, please bump this comment via emoji so I can see it in my inbox in 2042

in all seriousness, still happens at version 2.7.7

Whisper40 commented 11 months ago

Hello, same here version 2.5.0. A controller restart has been needed to be able to sync again.

javydekoning commented 11 months ago

Facing the same on v2.7.9 while (re)deploying https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack

turesheim commented 11 months ago

Looking at what appears to be the same issue on ArgoCD v2.6.7. I've killed all controller pods, server and repo server at the same time, hoping that ArgoCD would start behaving. But to no avail. I believe the reason for it to start behaving like this is in the first place was a ImagePullBackOff on the Job image.

asaf400 commented 11 months ago

Looking at what appears to be the same issue on ArgoCD v2.6.7. I've killed all controller pods, server and repo server at the same time, hoping that ArgoCD would start behaving. But to no avail. I believe the reason for it to start behaving like this is in the first place was a ImagePullBackOff on the Job image.

We had to completely exclude all jobs from argo cd via resource exclusion global config: https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion

And we migrated all jobs on repo to cronjobs with suspend: true BUT fair warninig, Due to a k8s bug sometimes cronjobs may be triggered when changing spec, INCLUDING changing suspend: false to suspend: true - yes it's stupid like that..

I think it's this one, but there are others as well.. https://github.com/kubernetes/kubernetes/issues/63371

turesheim commented 11 months ago

Ended up executing kubectl delete application <app> after fixing the reason for the ImagePullBackOff. This resulted in the stuck job disappearing and the application resynced and in order. So there is a decent way get rid of the stuck Job.

jDmacD commented 10 months ago

I just figured out what was causing Argo to freeze on the hook. In my case the specific hook had ttlSecondsAfterFinished: 0 defined in the spec. Through Kustomize I removed this field:

patches:
  - target:
      name: pre-hook
      kind: Job
    path: patches/hook.yaml
# patches/hook.yaml
- path: "/spec/ttlSecondsAfterFinished"
  op: remove

Afterwards the chart finally went through! It's still a bug that should be addressed, I'm just sharing this for others to work around it.

@boedy You're a Saint. I've been staring at envoyproxy/gateway for two weeks.

KilJaeeun commented 7 months ago

hello ! My preinstall job is still syncing 10 hours after it finished. I was told to delete the ttl strategy, so I deleted the ttl in the pre install job, but I'm still having the same problem. Does this mean I should change the ttl strategy on the argocd side? Is this something that hasn't been resolved yet in the latest update?

cc. @boedy

I just figured out what was causing Argo to freeze on the hook. In my case the specific hook had ttlSecondsAfterFinished: 0 defined in the spec. Through Kustomize I removed this field:

patches:
  - target:
      name: pre-hook
      kind: Job
    path: patches/hook.yaml
# patches/hook.yaml
- path: "/spec/ttlSecondsAfterFinished"
  op: remove

Afterwards the chart finally went through! It's still a bug that should be addressed, I'm just sharing this for others to work around it.

imroc commented 4 months ago

Issue still exists in argocd:v2.10.1 image

sanketnadkarni commented 4 months ago

observing the same randomly for a few services in v2.9.5+f943664 image

Lp-Francois commented 4 months ago

Observing the same issue with argocd v2.9.2+c5ea5c4

From a manifest like this job: https://github.com/aws-samples/eks-gitops-crossplane-argocd/blob/main/crossplane-complete/templates/2-aws-provider.yaml#L28

Screenshot 2024-03-01 at 10 31 37 Screenshot 2024-03-01 at 10 31 45
michalschott commented 3 months ago

Still happens in v2.10.2+fcf5d8c

asaf400 commented 3 months ago

Still happens in v2.10.2+fcf5d8c

Haha 🤣 , Good one 👍

huguesalary commented 2 months ago

Running into this issue on v2.10.1+a79e0ea as well

fabi-alstom commented 2 months ago

I'm using v2.11.0+f287dab and hit the same problem with whatever version of kube-prometheus-stack from 45.0.0 on. Last testing was with the most recent one, 58.0.0, and still facing the issue. Sadly, no workaround worked for me yet

JuniorJPDJ commented 2 months ago

The same with envoy-gateway