GoogleCloudPlatform / gke-managed-certs

Managed Certificates for Kubernetes clusters using GCLB
Apache License 2.0
246 stars 32 forks source link

Pre 0.4.2/GKE 1.16.8-gke.3: bug quickly re-creating a ManagedCertificate #45

Closed krzykwas closed 3 years ago

krzykwas commented 4 years ago

Pre 0.4.2/GKE 1.16.8-gke.3 with a low frequency there can occur a bug in handling fast re-creation of a ManagedCertificate resource.

When a ManagedCertificate is deleted, a cleanup process starts, intending to remove the accompanying GCP resource. If the certificate is re-created - before the cleanup process has finished - the certificate may become stuck in an invalid state.

Diagnosis:

  1. The certificate has status FailedNotVisible.
  2. In the internal state of the GKE Managed Certificates controller the certificate will be SoftDeleted: true, check $ kubectl describe configmap managed-certificate-config -n kube-system
  3. The Ingress annotation managed-certificates does not include the certificate in this invalid state.
  4. The Ingress annotation pre-shared-cert should not include the certificate in this invalid state either, but if it is the only certificate attached to Ingress pre GKE 1.16.0-gke.20, it won't be detached because of a different Ingress issue; in this case the pre-shared-cert annotation will include the SslCertificate.

Workaround:

You need to delete the ManagedCertificate and allow up to 2 minutes for the GKE Managed Certificates controller to finish the clean up process.

Pre GKE 1.16.0-gke.20, because of Ingress not releasing the last certificate, the clean up process cannot succeed. You have the following options:

To fix the faulty certificate:

  1. detach the ManagedCertificate resource from Ingress (remove from the managed-certificates annotation)
  2. delete the ManagedCertificate
  3. after 2 minutes re-create the ManagedCertificate and attach it to Ingress.
Kezzsim commented 4 years ago

That's cool and all but it doesn't seem to be a good or even a decent replacement for wildcards when it comes making a domain separated multitenant cluster. When I add a new subdomain using kubectl patch it fails to patch the resource in place, instead it deletes the whole managed cert, creates a new one with the new domain added to it, and then attaches that cert to the load balancer... Which would be fine... outside of a production environment.

When it's doing that, requests to all sites that used that certificate go down which is quite honestly very bad. I'm on 1.16.8-gke.8 and still this is an issue. Are you saying that if I delete the cert and then wait two minutes and apply a new one with the new domain that I will experience less downtime?

krzykwas commented 4 years ago

@Kezzsim: this issue manifests in the cleanup phase that cannot run to completion, i. e. the certificate becomes broken and cannot recover.

You describe a different problem: indeed the certificate right now cannot be updated in-place. Several approaches to addressing it have been considered, but none was pursued:

You can however create a new certificate and remove the old one, when the new one starts working.