clastix / cluster-api-control-plane-provider-kamaji

The Kamaji Control Plane provider implementation of the Cluster Management API
Apache License 2.0
80 stars 29 forks source link

Fix reconcilation problem after restore KamajiControlPlane from backup #83

Closed andreykont closed 6 months ago

andreykont commented 8 months ago

My goal is restore managed cluster from backup. I use Velero for restore. First of all I recover etcd database, TenantControlPlane and other generic capi resources. On last stage I have to recover KamajiControlPlane resource.

KamajiControlPlane gets "Ready" status, but I see error in kamaji control plane controller's log:

2024-02-21T13:54:48Z ERROR Reconciler error {"controller": "kamajicontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "KamajiControlPlane", "KamajiControlPlane": {"name":"ocean-ocean-avkontya3-jvpinos","namespace":"ocean-ocean-avkontya3-jvpinos"}, "namespace": "ocean-ocean-avkontya3-jvpinos", "name": "ocean-ocean-avkontya3-jvpinos", "reconcileID": "2cece214-b52e-41fc-884a-cac3ebd28bd6", "error": "cannot create or update CA secret: missing Certificate value from *kamajiv1alpha1.TenantControlPlane CA", "errorVerbose": "missing Certificate value from *kamajiv1alpha1.TenantControlPlane CA\ngithub.com/clastix/cluster-api-control-plane-provider-kamaji/controllers.(*KamajiControlPlaneReconciler).createOrUpdateCertificateAuthor

Cause of problem: In my case kamajiCA and capiCA secret names are equals. Also when I restore kcp with Velero, it doesn't restore secret OwnerReferences. Kamaji control plane detects it like it's own object and delete "ca.crt" and "ca.key" data in kamajiCA secret. On the second iteration it makes error, because ca.crt deleted.

Wanted behavior: KamajiControlPlane controller doesn't control kamajiCA secrets.

How I solved the error: Our provider doesn't have to control KamajiCA secret. I replaced condition that check who is owner of CA secret. Now we control it by secret names.

prometherion commented 8 months ago

I'm a bit confused @andreykont, isn't this a problem in Kamaji upstream rather than on the CAPI provider?

andreykont commented 8 months ago

I'm a bit confused @andreykont, isn't this a problem in Kamaji upstream rather than on the CAPI provider?

Good day, @prometherion . Yes. I completely agree. Kamaji doesn't restore owner reference of it's CA secret. It's a mistake. But anyway it might take some period of time. I think capi provider has to control such situation (kamaji and capiCA secret names are equal) to avoid racing. In addition, I think Kamaji project knows nothing about CAPI project and don't have to control it's CA secret name (to be not equals with CAPI project secret name).

prometherion commented 8 months ago

Kamaji doesn't restore owner reference of it's CA secret

That's correct: https://github.com/clastix/kamaji/blob/eff68db336f3c161447e08a6733344260c10e12a/internal/resources/ca_certificate.go#L92-L110

This snippet shows we're just avoiding rewriting the secret content if the certificates are matching since Velero is not able to restore the Owner Reference.

But anyway it might take some period of time

It's Kubernetes, eventually consistent :) Jokes apart, we can wait for a few iterations before having everything reconciled properly, and I'd say Kamaji resources are our source of truth since the CAPI CP provider is almost dummy.

I think capi provider has to control such situation (kamaji and capiCA secret names are equal) to avoid racing. In addition, I think Kamaji project knows nothing about CAPI project and don't have to control it's CA secret name (to be not equals with CAPI project secret name)

This is a bit problematic since we started Kamaji completely decoupled from Cluster API, and the CP provider is the one responsible for this, however, we stumbled upon a naming collision afterwards and had to patch in this way.


tl;dr; WDYT in fixing this in Kamaji, rather than in CAPI?

andreykont commented 8 months ago

Hello, @prometherion.

Let me clarify the problem once more. After restoring kamaji tenant control plane’s secret with cert and key is restored, KamajiControlPlane Provider start reconcilation and adopts the secret changing keys in its map: https://github.com/clastix/cluster-api-control-plane-provider-kamaji/blob/346f3ca1de43661a2b44bee84c66f60f1d29e24c/controllers/kamajicontrolplane_controller_resources.go#L102-L105

Then it updates secret and returns to reconcile process and comes back to check secret once more. As ownerReference is set by KamajiControlPlaneProvider line https://github.com/clastix/cluster-api-control-plane-provider-kamaji/blob/346f3ca1de43661a2b44bee84c66f60f1d29e24c/controllers/kamajicontrolplane_controller_resources.go#L108

the secret is not valid for that check: https://github.com/clastix/cluster-api-control-plane-provider-kamaji/blob/346f3ca1de43661a2b44bee84c66f60f1d29e24c/controllers/kamajicontrolplane_controller_resources.go#L77

and KamajiControlPlaneProvider tries to find ca.crt and ca.key in secret, which it patched during the previous reconcilation loop. As the result it logs errors as it is not able to find in TenantControlPlane secret ca.crt,

another solution is to add more checks there https://github.com/clastix/cluster-api-control-plane-provider-kamaji/blob/346f3ca1de43661a2b44bee84c66f60f1d29e24c/controllers/kamajicontrolplane_controller_resources.go#L77

either by name equals or that secret is already patched for CAPI.

I think it should not be patched from the kamaji upstream as it is not aware of kamajiControlPlane provider. Moreover, i think it can be fixed in kamajiControlPlane provider.

If we come to CAPI Project to fix that it is going to break their agreement on how to prepare secrets i think. https://cluster-api.sigs.k8s.io/developer/architecture/controllers/cluster.html#secrets

prometherion commented 7 months ago

I think it should not be patched from the kamaji upstream as it is not aware of kamajiControlPlane provider

True the fact Kamaji is not aware of the CAPI CP provider, but Owner/Controller References should be fixed upon first reconciliation, that's the reason why I'm suggesting fixing in Kamaji.

Once we have the OwnerReference in place, there's not bug in in the CP provider, if I understand correctly.

andreykont commented 7 months ago

Hello, @prometherion. Ok, I will open PR in Kamaji project to restore owner reference during reconciliation process.

andreykont commented 6 months ago

Fixed in https://github.com/clastix/kamaji/pull/427