Upjet leaks orphan external resources upon provider pod crash/restart

What problem are you facing?

Hello, I'm working on project requiring to manage in GCP thousands of objects. I wanted to test Crossplane's behaviour with this large number of objects created. In k8s, pods have no guarantee to run to completion and may be interrupted at any time, including graceful K8S node shutdown. I have observed that when the provider's pod is terminated while a crossplane MR reconciles and an external resource is being created, we have no information to be able to recover manually. This lack of information is problematic when you have to manage a large number of objects. I made a benchmark on the GCP provider benchmark_gcp_18-10-2023. The project contains installation and configuration code.

Details of the infrastructure:

Date: 18/10/2023
Platform: Google Cloud Platform (GCP), GKE Cluster Standard
Version: K8s: 1.27.3-gke.100, Crossplane: v1.13, provider-gcp-dns:v0.37.0
Setting: Default, with Crossplane ControllerConfig: --debug and --enable-management-policies
Compute Machine type: e2-highcpu-16 (16vCPU/16Go)

On the run I have called Run N°5: 4000 RecordSets: Service interruption during resource creation I have terminated the GCP provider pod during the deployment of 4000 RecordSets.

Top diagram: Number of log received from the provider. Botton diagram: Number of API calls reveiced by GCP for the DNS API endpoints.

After provider restart some resources are already created. As the provider was unable to complete the creation of the object in K8s, in the state of the K8s cluster the object does not exist. It will then try to start creating it again. The provider tries in loop to create the resources but fails because the name of the RecordSet already exist.

The provider tries to create in loop the resource recordset/benchmarkeight339 but Terraform fails in loop with 409 error.

Log details (exported from GCP Logs Explorer):

benchmarkeight339.csv, benchmarkeight339.json

In the case of GCP RecordSet, two external resources cannot have the same name. But if we use resources referenced by their ID, the resource will then be orphaned. By default no information about creation in progress is printed. In the GCP provider we need to start the pod with the --debug option.

How could Upjet help solve your problem?

The provider could give more information in order to reconciliate manually the situation. For exemple:

By default, the provider log could contain information about the creation status of an object. Without the need to rely on the --debug option.
The MR resource can block on Status.Conditions.Reason "Creating". If the provider start and see MR resource in "Creating" it mean something have not been done properly. The object is put in error to warn the human operator about potential orphan resource in GCP (e.g.: "The provider restarted during resource creation"). This is the way Terraform has chosen: The .tfstate is lock at startup. In case of a crash, the lock is not release. Terraform cannot be started up again without an operator manually releasing the lock.
I don't know if this is possible: Create an object on the cloud provider in a transactional way. In order to perform failure recovery. Undoubtedly complicated if we go by Terraform.

Hi @arfevrier, Thank you for reporting these issues.

I also agree your concerns about the lack of progress logs in the non-debug log level of the provider.

Regarding tracking potentially orphaned resources due to pod restarts via the status conditions, we already have such a mechanism implemented via the more stable metadata.annotations. In the context of Kubernetes API resources and controllers, it's advised not to make decisions by reading the status:

... status should be able to be reconstituted from the state of the world, so it’s generally not a good idea to read from the status of the root object.

So, instead of using the information we make available in the status and branching the flow of control in the controller based on the status information, we have previously chosen to do so using the crossplane.io/external-create-{pending,succeeded,failed} annotations. These annotations are considered as critical annotations and clients with retry capabilities are used when there's a need to update the values of these annotations.

Having provided some previous context on the issue, I believe there's room for improvement here. Upjet has what we call as an asynchronous mode of reconciliation and the above annotation-based consistency mechanisms do not work as intended as they assume when the flow of control returns back to what we call as the managed reconciler during the reconciliation of an MR, the result of the creation operation is known. In case of this async reconciliation, the MR reconciler (in this case upjet runtime) just reports success to the managed reconciler, which in turn annotates the MR with the external-create-succeeded annotation, whereas the actual external create call is probably ongoing in a separate Go routine. Although this requires some further investigation, I think this is what you are observing because the guards in place to prevent the leaks in case of pod restarts become ineffective. Of course, as you've also mentioned above, if we could put the actions of reading & updating the Kubernetes API resources and reading & updating the Cloud API resources in the boundary of the same distributed transaction, we could guarantee consistency. But these are not transactional entities and to the best of my knowledge, we need to live with the eventual consistency semantics and at times, the need for some manual interventions to establish consistency. As I mentioned, this needs further investigation but I believe there's room for improvement here.

many thanks @ulucinar for the prompt and detailed analysis !

We were wondering how much can be reused from terraform experience in recovering from failures.

https://stackoverflow.com/a/69652001 mentions terraform relies on the backend to lock the tf state and fail fast on crash during this lock.

Is upjet leveraging the local backend which locks that state using system APIs ? AFAIK, Linux system file lock is not persistent across process restart (e.g. see https://unix.stackexchange.com/a/594041

Is there a way to configure an alternative terraform backend that would support remote locking (and therefore decoupled from crossplane provider pod restart) ?

Hey folks, any updates here ?

crossplane / upjet

Upjet leaks orphan external resources upon provider pod crash/restart #305

What problem are you facing?

How could Upjet help solve your problem?