argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.98k stars 5.47k forks source link

Argo hook not running on auto sync only on manual sync #9830

Open OliverLeighC opened 2 years ago

OliverLeighC commented 2 years ago

Checklist:

Describe the bug

We have some jobs that use argo hooks to kick off a job but the hook only gets triggered when we run the app sync from the cli. The app itself is auto-syncing and will show as synced in the UI (with the commit hash of the latest change) but the hook doesn't get triggered so the job never actually runs unless we run argocd app sync migrations-mssql-dev in the terminal. Not sure if this is related but I have noticed that I can't manually trigger the sync using the UI because it doesn't seem to recognize any resources (this isn't as important cause i can use the cli, but I don't know if it's indicative of some other issue).

image

The hooks we are using are:

  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation

but we also tried Sync, and PostSync, and changing the deletion policy on HookSucceeded. We also tried using the generated name instead of metadata.name.

To Reproduce

here is the application (we have a couple but they are pretty much the same just different images):

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: migrations-mssql-dev
  namespace: argocd
spec:
  destination:
    namespace: dev-engineering
    server: https://xxxx.xxx.us-xxx.eks.amazonaws.com/
  project: default
  source:
    path: migrations-mssql
    repoURL: https://github.com/xxx/xxx.git
    targetRevision: main
    helm:
      valueFiles:
        - values.yaml
      parameters:
        - name: imageTag
          value: staged-552138452accb9a66b6f40cbd1c0e8800f62a5d1-1656015181
  syncPolicy:
    automated: {}

then under path migrations-mssql/Chart.yaml

apiVersion: v1
kind: application
description: Mssql migration Helm chart for Kubernetes
name: migrations-mssql
version: 1.1.1

and a job migrations-msslq/templates/job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: {{ .Release.Name }}
  labels:
    app: {{ template "migrations-mssql.name" . }}
    chart: {{ template "migrations-mssql.chart" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  backoffLimit: 1
  template:
    metadata:
      labels:
        {{- include "migrations-mssql.selectorLabels" . | nindent 8 }}
    spec:
      restartPolicy: Never
      containers:
        - name: migrations-mssql
          image: {{ .Values.imageRegistry }}:{{ .Values.imageTag }}
          imagePullPolicy: Always
          command: ["/bin/sh","-c"]
          args: 
            - /mssql-scripting/migrate-db-all.sh $(mssqlServer) \
              $(mssqlPort) \
              $(mssqlUser) \
              $(mssqlPassword) \
              $(FlywayEnvironment)
          env:
            - name: mssqlServer
              value: {{ .Values.mssqlServer | quote }}
            - name: mssqlPort
              value: {{ .Values.mssqlPort | quote }}
            - name: FlywayEnvironment
              value: {{ .Release.Namespace }}
            - name: mssqlPassword
              valueFrom:
                secretKeyRef:
                  key: mssqlPassword
                  name: {{ .Values.envFrom.secretRef.name }}
            - name: mssqlUser
              valueFrom:
                secretKeyRef:
                  key: mssqlUser
                  name: {{ .Values.envFrom.secretRef.name }}

When the image tag on the application is updated at helm.parameters.value that triggers the auto-sync and the application gets marked as sync and points to the newest image tag but the job never runs unless we manually sync in the cli.

Expected behavior

Pre-sync and sync hooks should run on auto-sync and not just when manually triggering a sync.

Screenshots

job last ran 5 days ago (when I ran the app sync command manually in the cli)

image

but the current sync status says the last sync was an hour ago (when the image was updated in github) but the job didn't run:

image

Version

Paste the output from `argocd version` here.
argocd: v2.3.2+ecc2af9.dirty
  BuildDate: 2022-03-23T05:19:12Z
  GitCommit: ecc2af9dcaa12975e654cde8cbbeaffbb315f75c
  GitTreeState: dirty
  GoVersion: go1.18
  Compiler: gc
  Platform: darwin/amd64
argocd-server: v2.3.4+ac8b7df

Logs

Logs show that the sync was successful but the job was never triggered

time="2022-06-29T19:29:09Z" level=info msg="Update successful" application=migrations-mssql-prod
time="2022-06-29T19:29:09Z" level=info msg="Reconciliation completed" application=migrations-mssql-prod dedup_ms=0 dest-name= dest-namespace=prod-engineering dest-server="https://xxx.gr7.us-east-1.eks.amazonaws.com" diff_ms=3 fields.level=2 git_ms=66 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=305
dantastisk commented 2 years ago

I am experiencing the same.

pp185214 commented 2 years ago

This will be good for us as well. Because we use PostSync Hooks for running containerized tests after deployment. And when we will update just version of containerized tests which are just for PostSync Hooks ArgoCD will not do autosync.

schmel commented 2 years ago

same trouble. argocd 2.4.13 manual sync - is work. Autosync - ignored job hook. app source: helm

    "argocd.argoproj.io/sync-wave": "-1"
    "argocd.argoproj.io/hook": "Sync"
    "argocd.argoproj.io/hook-delete-policy": "BeforeHookCreation"

worked, only if chart version changed.

Plork commented 1 year ago

same issue here. My guess is that the job will only trigger when they are resources changed outside of the PreSync/PostSync window.

But havent tested this fully yet.

sharovmerk commented 1 year ago

any idea how to solve it?

parabolic commented 1 year ago

If anyone else is still struggling with this issue, it can be solved (at least on version 2.6.7), with adding another resource (e.g. deployment with 0 replicas) without any hooks. I assume that Argocd needs other resources without hooks, so that it can do a comparison and can execute the automatic sync.

candonov commented 1 year ago

I tested this with 2.6.7 and it still doesn't work.

todaywasawesome commented 1 year ago

The main difference between manual and automated sync hooks is automated sync doesn't trigger synchooks when it's a self-heal action so @alexmt suggested there may be an issue there.

mrsimo commented 1 year ago

We just discovered that we were hit by this. We have an Application that runs a Job that uploads some static files to a storage bucket. Sometimes the only thing that changes in a commit is this Job's image, which contains the static files, and in those cases it's not triggering a sync. We'll have to look for an alternative, but this sounds like a bug to me :/, it's highly unexpected.

cvogelsong-kamana commented 1 year ago

Still experiencing this issue with v2.6.3+e05298b.

OliverLeighC commented 1 year ago

Any word on this?

crenshaw-dev commented 1 year ago

afaik no core maintainer has been able to prioritize this yet. If anyone is able to take Alex's hint and investigate further, that will be helpful.

dustin-rcg commented 1 year ago

Both regular auto sync and self-healing auto sync occur when the actual state of the cluster differs from the expected state in git. The distinction is whether the difference originates in git or in the live cluster itself. I don't know how argocd determines which case occurred, but for example modifying the image tag of a resource in git and having argocd auto sync would not be considered a self-heal.

However, this is the case where argocd is failing to run the hooks for us.

Can someone explain how sync phases are intended to work when using the app-of-apps pattern? Can changing the manifests of hook-annotated resources trigger auto sync, or do only non-hook-annotated manifests trigger auto sync? Can the resources that trigger the auto sync be in a sibling app to the resources in other hook phases?

nickRiNi commented 1 year ago

Hello everyone!!! This workaround (crutch) works for me. You need to create resource that will sync your app every time. Save commit or CI ID inside of it, it will change that resource every time, that triggers SYNC

MarioIvanovTide commented 1 year ago

I noticed that the manual Sync triggers the hooks only if all the application resources are marked for sync, hence the application itself is being synced. One possible solution is to add an info field in the application manifest and change it every time there is a change in the job manifest.

markhv-code commented 1 year ago

Ran into this same issue. Will try adding a dummy ConfigMap but I wish the auto-sync worked.

emedvesApk commented 8 months ago

Are there any updates on this? I think this issue is quite critical. There are a lot of helm charts that rely on hooks to work as expected. With the actual behaviour it's impossible to use ArgoCD for on-demand automated environments, as they always require some manual intervention to trigger those hooks.

PertsevRoman commented 7 months ago

Added helm chart version to a configmap. Little bit unfair but works)

alen-z commented 5 months ago

+1 for raising the criticality of this thread. Using Helm. Unexpectedly stuck at "one or more tasks are running" during sync. We have PreSync hooks. Sync with replace, force and prune recovers the namespace, but still "one or more tasks are running" because PostSync did not run that should run batch/Job!

Terrible.

For Helm, my expectation is to have close to Helm CLI behavior with hooks, install and upgrade. It's putting us in a tough spot to justify Argo CD.

Edit: To dial down on my frustration, we had a breakthrough on our side. It was mutation policy that was causing the issue. Argo was confused on what to do after mutation policy kicked in. Important was to make sure policy is deployed first and the the rest is fine.

alexvaigm commented 4 months ago

Any updates how to solve it? I have similar issue:

My application points to helm chart. Inside chart there is a role that defined as hook.

kind: Role
metadata:
  annotations:
    helm.sh/hook: 'pre-install,pre-upgrade,pre-rollback'
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook-weight: '1'

When running sync in UI, all fine When running sync using kubectl patch ... ,the role once created, second sync it removed , third sync created and so on.

kubectl patch app my_app  --patch-file  /tmp/patch.yaml  --type merge

operation:
  sync:
    syncStrategy:
      hook:
        force: true

any ideas how to solve it so role will always will be there?

imranismail commented 4 months ago

Hi, for those facing similar issue to this and you are running calico, please have a look at this issue: https://github.com/projectcalico/calico/issues/5715

twig1337 commented 2 months ago

Also having this issue. I'm installing the OpenTelemetry Operator Chart into my cluster. I'm using the self-signed certificate option which stores the Helm generated certificates into a Secret which is annotated with:

"helm.sh/hook": "pre-install,pre-upgrade"
"helm.sh/hook-delete-policy": "before-hook-creation"

So this means that whenever the cache invalidates (default 24 hours) the webhooks referencing the same CA get regenerated by Helm, but the Secret doesn't get updated/recreated and the operator stops injecting init containers 😬

tl;dr - This issue is forcing me to go install a full cert manager instance just for proof of concept work.

EDIT

I found a solution for my issue here. Using Argo's system-level configuration worked to ignore the fields that the "application-level" config didn't affect.

This isn't a true fix for this issue and will only work for situations where you can ignore fields on other resources so that the skipped hooks don't fall out of sync.