Upon provider pod crash/restart, MR deletion leaves orphaned external resources on cloud provider

arfevrier commented 7 months ago

Resource deletion and orphaned resources on cloud provider

What happened?

Hello, I created a benchmark to perform tests with thousands of GCP objects. During the MR deletion part of my benchmark, sometimes (less than 1% of managed resources) the MR is deleted from K8s cluster but not the external resource from GCP. The benchmark has been made on the GCP provider, but I think the root cause is in the upjet framework generator. GCP provider in version v0.37, upjet version v0.11.0-rc.0.

> Here a link to the my Run N°4: 4000 RecordSets: Create, shutdown then Delete. On this run, I have created 4000 RecordSets MRs. Then, wait for the MR object to be in ready state. Scale K8s cluster to 0 worker node, then scale back to 1 worker node. After requesting deletion of all RecordSets some of then are been deleted from Kubernetes, but not from GCP console. Top diagram: Number of API calls reveiced by GCP for the DNS API endpoints. Botton diagram: CPU usage of the worker node

The CPU consumption of the node is 60%, meaning 10vCPU usage for the provider. During my other benchmark the provider has never been able to consume more than 10vCPU. This means that it is problably already consuming the maximum amount of available CPU time.

For example the object recordset.dns.gcp.upbound.io/benchmarksix10 is missing the Terraform deletion log. This object is still present on GCP console. Object recordset.dns.gcp.upbound.io/benchmarksix11 is correctly deleted and not present anymore on GCP console. Please find the complete log export (exported from GCP Logs Explorer):

For the object benchmarksix10 this two logs are missing:

2023-10-11T15:49:08Z DEBUG events Successfully requested deletion of external resource [...]
2023-10-11T15:49:08Z DEBUG provider-gcp Successfully requested deletion of external resource [...]

> Here a link to the my Run N°6: 2000 Buckets. The run deploy 2000 Bucket at (1 rq/s), and then request delete buckets delete. Same behavior as the run n°4: Bucket deleted from K8s but not from GCP

Correctly deleted in GCP and K8s: bucketbenchone1929.csv, bucketbenchone1929.json
Not deleted in GCP: bucketbenchone1928.csv, bucketbenchone1928.json

For the object bucketbenchone1928, this log is present:

2023-10-20T17:30:44Z INFO provider-gcp Failure limit has been exceeded [...]

Expected behavior ?

When requesting deletion of an object in K8s/Crossplane, the provider should wait the deletion confirmation from GCP/Terraform. Here problably the terraform destroy logic have not finish correctly but the object was deleted anyway.

How can we reproduce it?

The project benchmark_gcp_18-10-2023 contains a description of how I build my Crossplane environment (installation, configuration and script)s:

Date: 18/10/2023
Platform: Google Cloud Platform (GCP), GKE Cluster Standard
Version: K8s: 1.27.3-gke.100, Crossplane: v1.13, provider-gcp-dns:v0.37.0
Setting: Default, with Crossplane ControllerConfig: --debug and --enable-management-policies
Compute Machine type: e2-highcpu-16 (16vCPU/16Go)

ulucinar commented 7 months ago

Hi @arfevrier, Thank you for the detailed analysis & reports. I see the following messages in the logs related to the benchmarksix10 resource:

"2023-10-11T15:48:59Z\tDEBUG\tprovider-gcp\tSuccessfully deleted managed resource\t{\"controller\": \"managed/dns.gcp.upbound.io/v1beta1, kind=recordset\", \"request\": {\"name\":\"benchmarksix10\"}, \"uid\": \"0eb1ea8a-3814-4322-aa6e-bc3ce30d2868\", \"version\": \"6629296\", \"external-name\": \"projects/sbx-31371-7bnts707l94ejkf4dv7n/managedZones/telco-zone-dns/rrsets/benchmarksix10.telco.orange.com./A\", \"deletion-timestamp\": \"2023-10-11 15:48:58 +0000 UTC\"}"

"2023-10-11T15:51:44Z\tDEBUG\tprovider-gcp\tCannot get managed resource\t{\"controller\": \"managed/dns.gcp.upbound.io/v1beta1, kind=recordset\", \"request\": {\"name\":\"benchmarksix10\"}, \"error\": \"RecordSet.dns.gcp.upbound.io \\\"benchmarksix10\\\" not found\"}"

Do you have an idea on why the provider could not find the external resource (the external resource in this context is the actual set of Google Cloud DNS records)? Looks like the managed resource (RecordSet.dns with name benchmarksix10 actually acquired a deletion timestamp) and when the provider observes the external resource and it can no longer find it (e.g., a 404 is returned), it removes the finalizer on the managed resource (MR, a Kubernetes Custom Resource that adheres to the Crossplane resource model), which in turn removes the object from the etcd store, i.e., actually deletes the object marked for deletion.

One explanation is for what happened to benchmarksix10, because as you mentioned the Successfully requested deletion of external resource logs are missing for this MR, it was marked for deletion (i.e., it acquired the deletion timestamp because, for example, you asked for the deletion of the MR with kubectl) and when the provider had a chance to reconcile the resource, it observed that the resource no longer exists (and hence the \"RecordSet.dns.gcp.upbound.io \\\"benchmarksix10\\\" not found\" log message above) and because the corresponding external resource is no longer found, the finalizer is removed, in which case the object is actually deleted.

But then the external resource must have been deleted by an external entity (than the provider) and what I understand this is not the case.

Another more dangerous case is the provider loses the track of the external resource. The provider associates the MR with the external resource using the external-name annotation and a bunch of other piece of information which is configured using upjet's resource configuration framework. In this specific case for the RecordSet.dns resource, the external-name configuration is config.IdentifierFromProvider, which means all the information to relate the MR to the external resource is available in the external-name annotation (with the key crossplane.io/external-name). This external-name configuration is the most stable one although it can result in unnormalized MR APIs (an API quality issue) but this could still be an external-name configuration issue. The reasoning goes as follows: The external-resource is provisioned successfully, the MR transitions to the Ready == True, Synced == True state, upon further reconciliations the external-name annotation's value is updated and when the resource is observed with the updated external-name, it can no longer be found (and if it has acquired the deletion timestamp, it's ready for garbage collection and is garbage collected by Kubernetes). But such an external-name change would also be affecting non-deleted MRs of kind RecordSet.dns: Non-deleted ones would also not find their corresponding external resources and thus attempt to reprovision them, causing a name conflict and resulting in Synced == False states with a reason like a record set with the given name already exists. This is something similar you observed in #305 with pod restarts. My understanding is that we are facing different symptoms here.

Looking at the underlying Terraform provider's documentation, I see the following:

The provider treats this resource as an authoritative record set. This means existing records (including the default records) for the given type will be overwritten when you create this resource in Terraform. In addition, the Google Cloud DNS API requires NS records to be present at all times, so Terraform will not actually remove NS records during destroy but will report that it did.

I don't have experience with GCP DNS recordsets but could this maybe related to our observations? At first sight, it does not seem related, at least for the benchmarksix10 resource because what we have observed is the lack of provider's attempt to make the deletion call (the lacking successfully requested deletion of external resource log) but could it be related to the "although MRs are deleted the external resources are there" part of the discussion?

arfevrier commented 7 months ago

Hi @ulucinar, Thank you for your comment. I’m not sure to understand why you have said the “provider could not find the external resource”. There is no such information printed in the log. The end of log for the benchmarksix10 resource look more like this:

2023-10-11T15:48:59Z\tDEBUG\tprovider-gcp\tSuccessfully deleted managed resource\t
2023-10-11T15:51:44Z\tDEBUG\tprovider-gcp\tReconciling\t
2023-10-11T15:51:44Z\tDEBUG\tprovider-gcp\tCannot get managed resource\t

Successfully deleted managed resource: Here we are on the line reconciler.go#L998 of the reconcile loop. This mean the finalizer have been removed from the resource (on line 981). The resource has no more finalizer and K8s is going finally delete the resource. Because the resource have been modified, this will start a new reconcile loop. We see the new reconcile loop at 2023-10-11T15:51:44 Reconciling. Of course, the resource doesn't exist anymore (Cannot get managed resource) because the finalizer have been removed in previous reconcile loop.

That’s why I think we need to understand what happen before this Successfully deleted managed. I've just seen that the benchmarksix10 and benchmarksix11 objects have no Terraform deletion log. Maybe it's linked to your comment: "Terraform will not actually remove NS records during destroy but will report that it did."

BUT! We don't have the "Successfully requested deletion of external resource". Because we should enter the if condidition here reconciler.go#L942, but there is no "Cannot delete external resource" or "Successfully requested deletion of external resource" meaning the observation.ResourceExists is false. The external resource doesn't exist ? How this observation is done ? Because there is no log of Terraform refresh. The provider never perform a Terraform refresh after the restart of the provider at 15:40 (CET).

If we look at the Observe function external_nofork.go#L451 the only way to return ResourceExists=false without error is this condition:

if meta.WasDeleted(mg) && n.opTracker.IsDeleted() {
        return managed.ExternalObservation{
            ResourceExists: false,
        }, nil
    }

We can take a look also on the GCP Bucket issue. You can see on the log that the Bucket bucketbenchone1928 is deleted, but the audit log at the end show that my browser can still get access to the bucket.

With this bucketbenchone1928 bucket we have multiple time the log "Successfully requested deletion of external resource" but it comes with Terraform TTL error:

2023-10-20T17:30:44Z\tDEBUG\tevents\tcannot schedule a native provider during observe: c1545cfc-393b-4e22-857e-ce8108abdbb3: cannot schedule native Terraform provider process, please consider increasing its TTL

crossplane / upjet