Let's say you are deploying a change that removed an old Deployment or workers that you don't need anymore. These pods have a long terminationGracePeriodSeconds and you are using the default deletion policy of ArgoCD of Forground. While you deploy your changes the worker Deployment switched to deleting state but you realize there is a bug in the new version and decide to rollback.
At this point while the sync is still happening you decide to press the rollback button but ArgoCD only performs a basic kubectl apply command. This unfortunately is not enough to cancel the deletion state of the resource and k8s stills ends up killing you deployment, leaving you with no workers and incident and having to write down postmortems :stuck_out_tongue_closed_eyes:
To Reproduce
Create a repo with a basic Deployment with a very long terminationGracePeriodSeconds and a preStop hook similar to
preStop:
exec:
command:
- sh
- -c
- "sleep 100"
Create a basic app that targets that repo with auto-sync disabled.
Push a new change renaming the deployment (Any change that forces the deployment to be removed will suffice).
Sync in ArgoCD UI with the prune option enabled.
Immediately perform a rollback operation while the pods of the Deployment are still stopping and the Deployment is in terminating mode.
Wait a few minutes and observe how your deployment never comes back to live
Expected behavior
Ideally I would expect to be some k8s native way to issue a cancelled on deleting resource. I am not aware (although I haven't looked much into it) of such mechanism.
If no mechanism exist maybe ArgoCD can implement a bit of custom logic to check for resources being deleted but it may be too cumbersome.
Maybe we should switch the default deletion policy to the k8s default on of background. In this situation this would never happen as the deployment would immediately be deleted and recreated if need be.
Checklist:
argocd version
.Describe the bug
Let's say you are deploying a change that removed an old Deployment or workers that you don't need anymore. These pods have a long
terminationGracePeriodSeconds
and you are using the default deletion policy of ArgoCD ofForground
. While you deploy your changes the worker Deployment switched to deleting state but you realize there is a bug in the new version and decide to rollback.At this point while the sync is still happening you decide to press the rollback button but ArgoCD only performs a basic
kubectl apply
command. This unfortunately is not enough to cancel the deletion state of the resource and k8s stills ends up killing you deployment, leaving you with no workers and incident and having to write down postmortems :stuck_out_tongue_closed_eyes:To Reproduce
terminationGracePeriodSeconds
and apreStop
hook similar toExpected behavior
Ideally I would expect to be some k8s native way to issue a cancelled on deleting resource. I am not aware (although I haven't looked much into it) of such mechanism.
If no mechanism exist maybe ArgoCD can implement a bit of custom logic to check for resources being deleted but it may be too cumbersome.
Maybe we should switch the default deletion policy to the k8s default on of background. In this situation this would never happen as the deployment would immediately be deleted and recreated if need be.
Version