argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.05k stars 5.51k forks source link

The application gets stuck in the Sync state if some resource fails to sync when there is `BeforeHookCreation` set on certain jobs #16446

Open sklgromek opened 1 year ago

sklgromek commented 1 year ago

Checklist:

Describe the bug When synchronization fails due to attempting to change readonly fields in certain resources, or if there is an error in the specifications, the application gets stuck in the Synchronization state with a pending deletion status in BeforeHookCreation jobs. It can remain in this state for days until someone manually terminates it

To Reproduce

Deploy application with following specs

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- main.yaml

main.yaml

---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: www
  labels:
    service: www
spec:
  storageClassName: manual
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /var/tmp/web
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "nginx"
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job123
  annotations:
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    argocd.argoproj.io/hook: Sync
spec:
  template:
    metadata:
      labels:
        app: job123
    spec:
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        command: ["sleep", "10s"]
      restartPolicy: Never
  backoffLimit: 0

application.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: test-app
  namespace: argocd
spec:
  destination:
    namespace: default
    server: https://kubernetes.default.svc
  project: default
  source:
    path: test-argo
    repoURL: https://github.com/sklgromek/argo-sync-fail
    targetRevision: main
  syncPolicy:
    automated: {}

The application should be properly synced with these specifications.

In the next step, try to make a change in the StatefulSet, such as adjusting the size of the volume claim. After doing so, the application will become stuck in a synchronization state indefinitely.

repository with above specs: https://github.com/sklgromek/argo-sync-fail

Expected behavior

Sync process should be stopped when failed, without trying to run hooks

Screenshots

Zrzut ekranu z 2023-11-24 11-59-10

Version

{
    "Version": "v2.9.2+c5ea5c4",
    "BuildDate": "2023-11-20T17:18:26Z",
    "GitCommit": "c5ea5c4df52943a6fff6c0be181fde5358970304",
    "GitTreeState": "clean",
    "GoVersion": "go1.21.3",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v5.2.1 2023-10-19T20:13:51Z",
    "HelmVersion": "v3.13.2+g2a2fb3b",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.20.0"
}

But I notice the same also on version

{
    "Version": "v2.8.0+804d4b8",
    "BuildDate": "2023-08-07T14:25:33Z",
    "GitCommit": "804d4b8ca6bc4c2cf02c5c971aa923ec5b8623f0",
    "GitTreeState": "clean",
    "GoVersion": "go1.20.6",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v5.1.0 2023-06-19T16:58:18Z",
    "HelmVersion": "v3.12.1+gf32a527",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.20.0"
}

Logs

time="2023-11-24T12:03:45Z" level=info msg=Syncing application=argocd/test-app skipHooks=false started=true syncId=00368-JiaOG
time="2023-11-24T12:03:45Z" level=info msg=Tasks application=argocd/test-app syncId=00368-JiaOG tasks="[Sync/0 resource /PersistentVolume:default/www obj->obj (Synced,Succeeded,persistentvolume/www unchanged), Sync/0 resource /Service:default/nginx obj->obj (Synced,Succeeded,service/nginx unchanged), Sync/0 resource apps/StatefulSet:default/web obj->obj (SyncFailed,Failed,StatefulSet.apps \"web\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden), Sync/0 hook batch/Job:default/job123 nil->obj (,Running,Pending deletion)]"
time="2023-11-24T12:03:45Z" level=info msg="sync/terminate complete" application=argocd/test-app duration=5.056419ms syncId=00368-JiaOG
time="2023-11-24T12:03:45Z" level=info msg="No operation updates necessary to 'argocd/test-app'. Skipping patch" appNamespace=argocd application=test-app project=default
time="2023-11-24T12:03:45Z" level=info msg="getRepoObjs stats" application=argocd/test-app build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=294 unmarshal_ms=293 version_ms=0
time="2023-11-24T12:03:45Z" level=info msg="Skipping auto-sync: another operation is in progress" application=argocd/test-app
time="2023-11-24T12:03:45Z" level=info msg="getRepoObjs stats" application=argocd/test-app-2 build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=330 unmarshal_ms=329 version_ms=0
time="2023-11-24T12:03:45Z" level=info msg="Update successful" application=argocd/test-app
time="2023-11-24T12:03:45Z" level=info msg="Reconciliation completed" application=argocd/test-app dedup_ms=0 dest-name= dest-namespace=default dest-server="https://kubernetes.default.svc" diff_ms=10 fields.level=2 git_ms=294 health_ms=0 live_ms=0 patch_ms=21 setop_ms=0 settings_ms=0 sync_ms=0 time_ms=330
time="2023-11-24T12:03:45Z" level=info msg="Resuming in-progress operation. phase: Running, message: waiting for completion of hook batch/Job/job123" application=argocd/test-app
time="2023-11-24T12:03:45Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: default)" application=argocd/test-app
time="2023-11-24T12:03:45Z" level=info msg="getRepoObjs stats" application=argocd/test-app build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=2 unmarshal_ms=2 version_ms=0
andrii-korotkov-verkada commented 2 weeks ago

ArgoCD versions 2.10 and below have reached EOL. Can you upgrade and let us know if the issue is still present, please?

zzJinux commented 5 days ago

I'm experiencing the same issue. The version is v2.11.5+c4b283c.