cloudfoundry / korifi

Cloud Foundry on Kubernetes
Apache License 2.0
318 stars 65 forks source link

[Bug]: Resource name leases can get orphaned #1718

Open tcdowney opened 2 years ago

tcdowney commented 2 years ago

What happened?

@dsboulder ran into this why trying to apply CFOrg resources. Presumably there was an org that had the display name "development" at one time in the past, but it no longer existed. The lease for it remained, however, so a new org with that name could not be created.

He used kapp to declaratively apply the orgs so the output looks a little different than normal, but the error should be unrelated to that.

kapp deploy --app korifi-orgs -f - -y
Target cluster '<REDACTED>' (nodes: <REDACTED>, 3+)

Changes

Namespace  Name                Kind   Conds.  Age  Op      Op st.  Wait to    Rs  Ri  
cf         cf-org-org1         CFOrg  -       -    create  -       reconcile  -   -  
^          cf-org-development  CFOrg  -       -    create  -       reconcile  -   -  
^          cf-org-org2         CFOrg  -       -    create  -       reconcile  -   -  
^          cf-org-org3         CFOrg  -       -    create  -       reconcile  -   -  

Op:      4 create, 0 delete, 0 update, 0 noop, 0 exists
Wait to: 4 reconcile, 0 delete, 0 noop

8:31:13PM: ---- applying 4 changes [0/4 done] ----
8:31:13PM: create cforg/cf-org-development (korifi.cloudfoundry.org/v1alpha1) namespace: cf

kapp: Error: Applying create cforg/cf-org-development (korifi.cloudfoundry.org/v1alpha1) namespace: cf:
  Creating resource cforg/cf-org-development (korifi.cloudfoundry.org/v1alpha1) namespace: cf:
    API server says: admission webhook "vcforg.korifi.cloudfoundry.org" denied the request: {"validationErrorType":"DuplicateNameError","message":"Organization 'development' already exists."} (reason: {"validationErrorType":"DuplicateNameError","message":"Organization 'development' already exists."})

What you expected to happen

I expected there not to be an orphaned lease.

Acceptance Criteria

GIVEN a resource with a given displayName not exist on the cluster WHEN I try to create a resource with that displayName THEN I see the creation succeeds and I do not see any DuplicateNameError errors claiming the resource exists

How to reproduce it (as minimally and precisely as possible)

Unclear how this happened. It's possible there is a race condition in the DuplicateValidator webhook or that updating the lease failed somehow during deletion.

Anything else we need to know?

If we pick up this bug we need to make sure that the fix isn't worse than the bug. Sometimes failures to update the lease are acceptable during deletion (such as when the lease does not exist) and we don't want to break uninstallation somehow.

Environment

Revision of codebase: 2286a62f2ebcb58e103d0a2efeed1c6d2beda0c4 Kubernetes version (use kubectl version): Unknown Cloud provider or hardware configuration: Unknown Others:

danail-branekov commented 2 years ago

Hmmm, the DuplicateValidator seems to be quite thread safe, it is making use of the optimistic locking idiom. However, in order that validator to kick in and handle org delete and delete the lease, korifi controllers should have been running when the CFOrg has been being deleted. Can we confirm whether this has been the case?