argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.45k stars 4.97k forks source link

Application stuck in deletion because of a postDelete hook #18270

Open joffreychambrin opened 2 weeks ago

joffreychambrin commented 2 weeks ago

Checklist:

Describe the bug

Since we've upgrade to to ArgoCD v2.10.0, we have a specific application that gets stuck in Deletion. The only solution I have found is to manually remove the finalizers from the App.

To Reproduce

  1. Install the latest ArgoCD version on a Kubernetes cluster,
  2. Then add a Helm app with: URL: https://charts.releases.teleport.dev Chart: teleport-kube-agent. Version: 15.0.2
  3. And delete the ArgoCD app.

Expected behavior

The post hook delete job will be launch, and once completed the app should be automatically deleted from ArgoCD.

Version

argocd: v2.11.0+d3f33c0
  BuildDate: 2024-05-07T16:01:41Z
  GitCommit: d3f33c00197e7f1d16f2a73ce1aeced464b07175
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/arm64

Logs

time="2024-05-17T09:44:28Z" level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereferencegoroutine 230 [running]:
runtime/debug.Stack()
    /usr/local/go/src/runtime/debug/stack.go:24 +0x64
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppOperationQueueItem.func1()
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:890 +0x50
panic({0x2fb9ae0?, 0x6a9bf20?})
    /usr/local/go/src/runtime/panic.go:914 +0x218
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).executePostDeleteHooks(0x4000bee380, 0x401a639000, 0x40230d3200?, 0x4028b6af80?, 0x20?, 0x0?)
    /go/src/github.com/argoproj/argo-cd/controller/hook.go:101 +0x8f4
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).finalizeApplicationDeletion(0x4000bee380, 0x401a638c00, 0x36b3559?)
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1168 +0x560
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppOperationQueueItem(0x4000bee380)
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:926 +0x2c0
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run.func4()
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:831 +0x2c
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xae6fb19b55bdc225?)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157 +0x40
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x875c9889805965b5?, {0x49d7500, 0x4001575080}, 0x1, 0x40005b8240)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158 +0x90
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x4f1857488fcceb0a?, 0x3b9aca00, 0x0, 0xcc?, 0x231652ab9f10a9f1?)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135 +0x80
k8s.io/apimachinery/pkg/util/wait.Until(0x2997fcc5e506dda9?, 0x7c42a7c42a95252d?, 0xd8649a624c79c08a?)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92 +0x28
created by github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run in goroutine 70
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:830 +0x728
"
ChristianCiach commented 2 weeks ago

What kind of hook are you using? A Job? Looking at the code, there is something wrong with the health check for the hook. Are you using custom health checks as Lua scripts?

ChristianCiach commented 2 weeks ago

I think https://github.com/argoproj/argo-cd/blob/v2.11.0/controller/hook.go just misses necessary nil-checks after calling GetResourceHealth. Every other caller of GetResourceHealth (both in argo-cd and in gitops-engine) checks the returned health object for nil, but not the post-delete hook controller. The returned health object can indeed be nil.

ChristianCiach commented 2 weeks ago

@joffreychambrin You can probably workaround the bug by adding this to your argocd-cm configmap:

  resource.customizations: |
    batch/Job:
      health.lua: |
        hs = {}
        hs.status = "Progressing"
        hs.message = ""
        if obj.status ~= nil then
          if obj.status.conditions ~= nil then
            for i, condition in ipairs(obj.status.conditions) do
              if condition.type == "Complete" and condition.status == "True" then
                hs.status = "Healthy"
                return hs
              end
              if condition.type == "Failed" and condition.status == "True" then
                hs.status = "Degraded"
                return hs
              end
            end
          end
        end
        return hs

Completely untested, and I would advise against it. The idea here is to guarantee a non-nil hookHealth object in https://github.com/argoproj/argo-cd/blob/d3f33c00197e7f1d16f2a73ce1aeced464b07175/controller/hook.go#L101 to workaround the missing nil check there.

ChristianCiach commented 2 weeks ago

@joffreychambrin The unit test introduced in https://github.com/argoproj/argo-cd/pull/16595 only tests this feature by using a bare Pod. So it would probably also work when you replace your Job with a bare Pod.

ChristianCiach commented 2 weeks ago

@alexmt Maybe you want to know about this :)

joffreychambrin commented 1 week ago

@ChristianCiach Yes the hook I am using is a job. It is the default one from Teleport here: https://github.com/gravitational/teleport/blob/70ba6be2eac4f8e5c275ffac6246197ef798392a/examples/chart/teleport-kube-agent/templates/delete_hook.yaml#L47

I think you are right in your analysis, because as you can see in the above link, there are multiple resources with the post-delete hook: a service-account, a role, a roleBinding, and a job. And not all of them have a healthCheck, therefore the necessity to have a nil check on the hookHealth