cloudfoundry / korifi

Cloud Foundry on Kubernetes
Apache License 2.0
317 stars 65 forks source link

[Bug] Org deletion might fail because of `allowCascadingDeletion` flag not being set #702

Closed danail-branekov closed 2 years ago

danail-branekov commented 2 years ago

Currently we are setting the allowCascadingDeletion flag on orgs during org deletion.

There are some indications that this might not have immediate effect and HNC controller might deny org deletion due to maybe webhook cache. We have seen that when playing with e2e tests.

We should probably therefore move setting that flag when we actually create the org so that it is already set up whenever we want to delete it.

Here is a sample handler logs from a local e2e failure:

        /usr/local/go/src/net/http/server.go:1930
    1.6456165260597837e+09      ERROR   Org Handler     unauthorized to delete org      {"OrgGUID": "1f8fb1e7-2d14-48b4-aa6a-38ddb0cf11dd", "error": "Org forbi
dden: admission webhook \"subnamespaceanchors.hnc.x-k8s.io\" denied the request: The subnamespace 1f8fb1e7-2d14-48b4-aa6a-38ddb0cf11dd is not a leaf and doesn'
t allow cascading deletion. Please set allowCascadingDeletion flag or make it a leaf first."}
    code.cloudfoundry.org/cf-k8s-controllers/api/apis.(*AuthAwareHandlerFuncWrapper).Wrap.func1
        /workspace/api/apis/auth_aware_handler.go:33
    net/http.HandlerFunc.ServeHTTP
        /usr/local/go/src/net/http/server.go:2047
    code.cloudfoundry.org/cf-k8s-controllers/api/apis.(*AuthenticationMiddleware).Middleware.func1
        /workspace/api/apis/authentication_middleware.go:76
    net/http.HandlerFunc.ServeHTTP
        /usr/local/go/src/net/http/server.go:2047
    github.com/gorilla/mux.(*Router).ServeHTTP
        /go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210
    net/http.serverHandler.ServeHTTP
        /usr/local/go/src/net/http/server.go:2879
    net/http.(*conn).serve
        /usr/local/go/src/net/http/server.go:1930
kieron-dev commented 2 years ago

e2e tests show that this change succumbs to the HNC propagation latency problem:

  1. We create subnamespaceanchor for an org
  2. HNC creates the namespace and sets anchor status to ok
  3. We attempt to patch the hierarchyconfiguration to set AllowCascadingDelete
  4. Our admin role from the root namespace has not yet been propagated to the new namespace and the patch fails.

Possible solutions:

  1. Use the privileged client to patch the hierarchyconfiguration. This isn't great for kubectl users.
  2. Retry the patch with exponential backoff for a while in the org repo CreateOrg method. This feels possibly too local.
  3. Block on a generic solution to this role propagation problem that is probably required after #572 is merged

@cloudfoundry/eirini, any thoughts?

gcapizzi commented 2 years ago

I would say 3, and do it directly as part of #572 as all sort of stuff would break otherwise, right? Alternatively, go with 2 and then reuse that as part of #572 but maybe that's too complicated to orchestrate between people.