cybozu-go / accurate

Kubernetes controller for multi-tenancy. It propagates resources between namespaces accurately and allows tenant users to create/delete sub-namespaces.
https://cybozu-go.github.io/accurate/
Apache License 2.0
34 stars 5 forks source link

Opt-in allowing cascading deletes of namespaces #119

Open erikgb opened 4 months ago

erikgb commented 4 months ago

What

As a cluster-admin I want to onboard and eventually terminate tenants in our cluster by creating and eventually deleting a root namespace for the tenant. The onboarding process works like a charm, but I am not too happy about the termination process - which we had an opportunity to try out this week. Let me explain:

We use a GitOps process where tenants are onboarded via a pull request to our "tenants" project. A tenant is in this project defined by some simple resources:

So far, so good. But this week we received a request for a tenant termination. Using a modern GitOps tool like Flux, with pruning enabled, we thought that it was as simple as reverting the onboarding PR in Git. So we did that, after getting the PR approved by the tenant responsible. What we forgot to do, was to check if there were sub-namespaces defined under the tenant root namespace. After merging the tenant termination PR, Flux immediately reported an error: It got (correctly) blocked by the Accurate namespace webhook:

delete failed, errors: Namespace/blnc delete failed: admission webhook "namespace.accurate.cybozu.io" denied the request: child namespaces exist;
kustomization/flux-tenants.flux-tenants

But this error was reported only once, which kind of surprised me - as Flux is usually constantly reconciling until the actual state equals the desired state. And as I suspected, the tenant namespace was still present - including the child namespaces that we did not think about. However the Flux controller resource had no knowledge of the resources it used to control anymore, which left the tenant root namespace (including children) as orphans in our cluster. This is something we are trying hard to avoid.

After cleaning up manually, I reached out to the Flux maintainers on Slack, and you can read all the details in the CNCF Slack thread - if you are interested. TL;DR: Flux maintainers think this is an issue with Accurate, and I tend to agree now.

To fix this for future tenant terminations in our clusters, I suggest adding an opt-in allowing us to configure the Accurate namespace webhook allowing cascade namespace DELETE requests.

How

I have some ideas after looking at the code, but please let me know what you think first! I will update this description once we agree on a solution. I'll be happy to submit a PR fixing this.

Checklist

erikgb commented 4 months ago

The easiest way to implement this would be to configure an Accurate installation to allow cascading deletes. It would be better if it could be expressed in some declarative way per namespace/sub-namespace, but I fear the complexities in an implementation of this. Looking at the docs for HNC in this area, it's not clear to me how this works - even after reading it multiple times. 😄

My ideal solution would be to allow cascading deletes of namespaces in general, which is admin area anyway. And block end-users from deleting SubNamespace with children by default. WDYT? Would it be possible to implement without adding a lot of complexity?

yamatcha commented 3 months ago

My ideal solution would be to allow cascading deletes of namespaces in general, which is admin area anyway. And block end-users from deleting SubNamespace with children by default

I don't think it is a good idea to cascade deletions by default. Because it is breaking changes and very dangerous. It would not be complicated if the implementation were to allow cascade deletion only for the cluster administrator.