carvel-dev / kapp-controller

Continuous delivery and package management for Kubernetes.
https://carvel.dev/kapp-controller
Apache License 2.0
267 stars 102 forks source link

App gets stuck deleting if service account is not created #114

Open benmoss opened 3 years ago

benmoss commented 3 years ago

What steps did you take: Installed the "simple-app" from the instructions without first creating the service account for it. https://github.com/vmware-tanzu/carvel-kapp-controller/blob/develop/docs/walkthrough.md

Tried to delete the app via kapp -a simple-app delete and see it hang/timeout.

What happened:

$ k get app
NAME         DESCRIPTION                                                                                         SINCE-DEPLOY   AGE
simple-app   Delete failed: Preparing kapp: Getting service account: serviceaccounts "default-ns-sa" not found   3s             56m

What did you expect: Delete should succeed

Anything else you would like to add:

Environment:

danielhelfand commented 3 years ago

I have also run into the issue described above in a similar scenario where I have used serviceAccount instead of serviceAccountName. This is definitely something we should look into prioritizing.

Could be similar to #69 but can dig a bit further to see if the issue itself has different nuance to it.

vibhas commented 3 years ago

Related to #71

aaronshurley commented 3 years ago

Adding Explore to define/propose a solution.

Reasonable solutions could be:

ewrenn8 commented 3 years ago

related: #34

ewrenn8 commented 3 years ago

Hey @benmoss,

After some discussion, we've decided the best thing we could do here is to update the error message with a bit more info. Since kapp-controller will be deleting resources, we need a service account that has the roles needed to do the deletion, so that we know it is allowed from an rbac standpoint. This means if the service account can't be found, kapp-controller won't be able to delete things safely.

Going forward, the easiest remedy in this case is to create the service account again. We will be sure to call this information out in the error message when we update it.

Thanks! Eli

danielhelfand commented 3 years ago

We will also explicitly document how to resolve this issue in kapp-controller docs.

As part of documenting how to address this issue, something we will need to call out is what are minimum rbac permissions needed for the serviceaccount to resolve this issue so a user can proceed with deleting the App.

benmoss commented 3 years ago

If I had to guess why it's not straightforward:

Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain. You might also be in a situation where the service account existed, resources were created, and then the service account was deleted, and so knowing the difference would mean caching that state of "i created x, y, z" on the app itself?

cppforlife commented 3 years ago

Since it never had the service account in the first place it didn't create anything, but the tricky bit is knowing that for certain.

ah, this is a good case to address.

ewrenn8 commented 3 years ago

I will update this issue to call out fixing that specific case and create another one for improving docs or error messages for the case resources do exist.

Problem statement

Kapp controller currently doesn't distinguish between deletes that will delete resources and deletes that just require deleting the App CR. This means, when a service account is missing, kctrl refuses to perform a delete in both cases, even though the former is the only one that requires a service account.

Desired State

kapp controller is able to delete apps that have not had any resources deployed without a service account being present.

Use Case

  1. Deploy kapp-controller
  2. Skip RBAC setup
  3. Deploy simple app example
  4. Verify deploy failed before applying resources
  5. delete the simple app
  6. see that it deletes successfully
julian-hj commented 1 year ago

We saw another version of this bug when trying to delete a namespace that contains a kapp app and a service account that it refers to.

The whole mess hangs forever after the service account gets deleted before the kapp finalizer can run.

heyjcollins commented 1 year ago

+1 to @julian-hj ... Can this be bumped in priority? This is a serious blocker for customers working at scale. It's not going to be feasible at scale for operators to manually inspect each namespace and delete all kaap apps before deleting the namespace itself.

evankanderson commented 1 year ago

One solution here might be to add the finalizer (which is what's blocking deletion) only after verifying that the service account exists and has access on the first reconciliation.

In combination with #1132, this could introduce the following model:

† Note that adding this ownerReference may change an App from being an independent object to an object subject to garbage collection, i.e. if the ServiceAccount is deleted. If the App is part of some larger constellation like a PackageInstall, it will have two ownerReferences, and both will need to be deleted.

ryanjbaxter commented 1 year ago

What is the best way to resolve this once you get into this stat? You can't recreate the service account because the namespace is Terminating so I am not sure how you actually delete the apps to allow the namespace to be deleted.

praveenrewar commented 10 months ago

@ryanjbaxter If you are already in this state, then you can remove the finalizers on the App CR. This might have side effect, if there were any resources created outside the namespace by the App, then those would still remain.

ragaskar commented 7 months ago

So ... I got this wedged pretty good just now when a tanzu package install of cert manager failed on a vCenter 8 TKG 2.2 cluster. Not totally sure that is supposed to work -- but I couldn't edit the finalizers on the App CR because kubectl claimed it couldn't find the app (even though it was letting me edit it). I finally had to issue a "create namespace cert-manager" and THEN kubectl let me edit the app and remove the finalizers. I also had to remove some remaining cert manager service account stuff (clusterrole, maybe some other cluster-scoped stuff) by hand. n.b., you have to "replace" finalizers with an empty array -- just deleting the key will leave the original value there and it will stay stuck.

I'm fairly new to k8s, but this is definitely one of the trickier situations I've had to try to get myself out of. Any additional pointers or automated things that could have been done would have helped me.

EDIT: by the way, for others experiencing this problem, I'll share my root issue as a possible lead (as someone new to k8s, a hint like this would've help me a lot). Deep under the hood, (i.e., PackageInstall -> App -> ... -> ReplicaSet -> ) pods were failing to spin up in the cert-manager namespace due to lack of permissions (the namespace did not allow privileged pods). Instead of trying to "delete" my way out at the namespace level, I gather the "right" way to fix would have been to update the namespace to allow privileged pods (depends on your k8s version, but for me it required adding a label) and then retrying the pod deploy. I think this would have happened automatically on some cadence, but in my case, I was able to speed it up by deleting the failing ReplicaSets, which were then immediately re-created by the parent App and then, since the namespace now had the right permissions, the pods deployed within seconds. Again, being new to k8s, deleting this resource to "retry" seemed pretty scary to me (am I going to break something?), but might be an obvious step to someone more experienced. Anything the tool could have done to give me more hints around figuring this out would have been helpful. e.g., maybe errors from pods could be rolled up somehow and displayed at time of failure vs. me having to "know" where to look for the error I need to fix (some of this isn't really in kapp's domain, to be fair).