GalleyBytes / terraform-operator

A Kubernetes CRD to handle terraform operations
http://tf.galleybytes.com
Apache License 2.0
357 stars 47 forks source link

Race Condition with ArgoCD deleting PVC #168

Open lallinger-tech opened 2 months ago

lallinger-tech commented 2 months ago

Hey,

i love your project, but i'm currently facing an issue when using it in combination with ArgoCD. When i delete a terraform CR via ArgoCD ArgoCD also instantly deletes the PVC that was created by the terraform-operator for the CR. This sometimes leads to the terraform operator instantly recreating the PVC, sometimes it leads to the terraform-operator getting stuck because the PVC is missing. I tried setting multiple ArgoCD annotations on the created PVC but to no avail. This seems like an ArgoCD bug, as somebody else is having a similar issue with StatefulSets https://github.com/argoproj/argo-cd/issues/13503 . Do you have a workaround for this problem or is this not happening for you? I tried with v0.17.0 of the terraform-operator and ArgoCD v2.10.9

isaaguilar commented 2 months ago

It probably to do with the ownerReference that ties the PVC to the Terraform Resource in Kubernetes. When the Terraform resource is deleted, the PVC is deleted . Aside from that , the PVC should only be created if the Terraform resource still exists.

I've done manual deletions of the Terraform resource to clean up the PVC. But I have not observed the PVC getting recreated. I wonder what ArgoCD is doing differently.

Let me try to understand Argo a little better. If ArgoCD deletes a resource:

1) Does ArgoCD issue a "delete" request on the resource in the target cluster? 2) Does ArgoCD delete respect finalizers? 3) Does ArgoCD remove references manually via any kind of annotations?

For troubleshooting, can you confirm that deleting a terraform resource manually does not "recreate" the pvc?

lallinger-tech commented 2 months ago

It probably to do with the ownerReference that ties the PVC to the Terraform Resource in Kubernetes. When the Terraform resource is deleted, the PVC is deleted . Aside from that , the PVC should only be created if the Terraform resource still exists.

Yes, ArgoCD issues a delete to every resource which is owned by the Terraform CR. As the CR has a finalizer it is not fully deleted until the terraform-operator removes the finalizer, so the operator recreates the PVC sometimes, sometimes it throws an error that it can't find the PVC because Argo deleted it.

I've done manual deletions of the Terraform resource to clean up the PVC. But I have not observed the PVC getting recreated. I wonder what ArgoCD is doing differently.

Let me try to understand Argo a little better. If ArgoCD deletes a resource:

1. Does ArgoCD issue a "delete" request on the resource in the target cluster?

Yes and all resources associated with the CR, so PVC, CM, Secret, Pods, etc.

2. Does ArgoCD delete respect finalizers?

Yes, but i tried adding a finalizer to the PVC via the taskOptions script section. This leads to the PVC staying and only entering terminating state, but this does not solve the problem as a terminating PVC can't be bound to a pod and the delete pods get stuck in pending waiting for the PVC

3. Does ArgoCD remove references manually via any kind of annotations?

No

For troubleshooting, can you confirm that deleting a terraform resource manually does not "recreate" the pvc?

Yes deleting the terraform CR via kubectl works just as you'd expect.

My hacky workaround for now is removing the ownerReferences in the setup script for the PVC (and CM and Secret, as i have observed that there's the same race condition) and then deleting the resources via kubectl as last action in the apply-delete step.

All in all this is definitely a bug caused solely by ArgoCD and not your work, as ArgoCD is not even caring about its own annotations which i tried to use to prevent deletion of these resources. So i'm not sure if you want to tackle this issue, i could fully understand if you wouldn't want to.

But i noticed a bug with your work that would help me out if you could fix it: https://github.com/GalleyBytes/terraform-operator/issues/169

Thanks for your work!

davhdavh commented 1 month ago

ArgoCD should only delete resources that are marked with

metadata:
  labels:
    argocd.argoproj.io/instance: xxx

and the sub-resources of the terraform-operator resource are not marked with that.

IE, it should delete the terraform-operator resource, which will trigger the finalizers on that, which in turn will run the delete. And then when the finalizer code allows it it will be deleteable and the subresources will be marked as deleteable.

lallinger-tech commented 1 month ago

ArgoCD should only delete resources that are marked with

metadata:
  labels:
    argocd.argoproj.io/instance: xxx

and the sub-resources of the terraform-operator resource are not marked with that.

IE, it should delete the terraform-operator resource, which will trigger the finalizers on that, which in turn will run the delete. And then when the finalizer code allows it it will be deleteable and the subresources will be marked as deleteable.

I totally agree that it SHOULD be like that, but it isn't. The sub resources do not have the instance label (nor the tracking id annotation) still they get deleted by argocd because it knows the subresources through the ownerReferences

davhdavh commented 1 month ago

Did you forgot to setup ArgoCD to actually respect the tf run status? https://tf.galleybytes.com/docs/getting-started/argo-cd/

lallinger-tech commented 1 month ago

No, i added the lua script and that works like a charm. Did you try to reproduce the problem?

davhdavh commented 1 month ago

No, it works perfectly for me: image image tf is in deleted state (and stuck there because the script failed, and I haven't had time to fix it), and the pvc is still alive and well.

lallinger-tech commented 1 month ago

the PVC is not still alive it got recreated. Compare the deletion timestamp of the terraform CR with the creation timestamp of the PVC => The moment the terraform CR got deleted the original PVC got deleted too but the terraform operator recreated the PVC instantly hence the creationtimestamp equalling the deletion timestamp. In your picture the creation timestamp of the terraform CR is one month earlier and so should the creation timestamp of the PVC be