Open ulucinar opened 2 years ago
Severity:low as we currently don't encounter this often. Please add your use-case if you encounter this.
This also happens with the GCP provider if you enable a ProjectService and then attempt to create a resource that uses the newly enabled API. Enabling an API on a project takes a while to fully propagate, and during that time the API will return API not enabled errors to random calls. This means it's possible to get as far as actually creating the GCP resource, but then be unlucky and fail one of the later calls in the provider, leading to the tainted state. The more calls that are needed to set up the resource, the more likely one of them will fail. This seems particularly noticeable with enabling the DNS ProjectService, combined with creating a ManagedZone and a RecordSet. In my tests the RecordSet ends up tainted around ~20% of the time.
@JonathanO is correct in saying the more calls needed to set up the resource, the more likely it is to end up in this tainted state. The TF untaint command mentions:
Terraform automatically marks an object as "tainted" if an error occurs during a multi-step "create" action, because Terraform can't be sure that the object was left in a fully-functional state.
We plan to address this issue upstream in the Crossplane Managed Reconciler to better accommodate the async operation we use in Upjet. This will allow us to ensure the reconciler applies the crossplane.io/external-create-failed
annotation.
The TL;DR is that while we won't be able to completely stop resources from getting into a tainted state, we will make it more obvious when this happens, and it will then follow the usual Crossplane remediation approach for resolving it which entails manually confirming the external resource state, and then removing the annotation on the resource, to allow reconciliation to continue.
Making improvements here is currently on our Q1 roadmap.
What happened?
We observed stability issues for resources
ClusterInstance.rds
andClusterRoleAssociation.rds
while doing tests in the context of https://github.com/upbound/official-providers/pull/590. The issue for both of them was that after they acquireReady
andSync
conditions withTrue
status, we may experienceReconcileError
s and theSync
condition becomesFalse
.For resources whose provisioning is done with multiple calls to AWS, if any of these steps fails, the resource's state is mark as tainted and Terraform will attempt to replace the resource. This fails because of the
prevent-destroy
lifecycle meta-argument we employ forTerraformed
resources.The theory is that when a related resource (
rds.ClusterActivityStream
) is simultaneously provisioned with theClusterInstance.rds
andClusterRoleAssociation.rds
resources, it affects the state of the associatedClusterInstace
resource, which in turn prevents the dependent successful provisioning of theClusterInstance.rds
andClusterRoleAssociation.rds
resources.This theory has been tested once by first separately provisioning all resources except the
rds.ClusterActivityStream
and then provisioning the remainingrds.ClusterActivityStream
, which succeeded.Here are the relevant manifests:
Known affected resources:
How can we reproduce it?
kubectl apply -f examples/rds
, excludingproxy*
resources.