Upjet cannot handle tainted Terraform state

ulucinar commented 2 years ago

What happened?

We observed stability issues for resources ClusterInstance.rds and ClusterRoleAssociation.rds while doing tests in the context of https://github.com/upbound/official-providers/pull/590. The issue for both of them was that after they acquire Ready and Sync conditions with True status, we may experience ReconcileErrors and the Sync condition becomes False.

For resources whose provisioning is done with multiple calls to AWS, if any of these steps fails, the resource's state is mark as tainted and Terraform will attempt to replace the resource. This fails because of the prevent-destroy lifecycle meta-argument we employ for Terraformed resources.

The theory is that when a related resource (rds.ClusterActivityStream) is simultaneously provisioned with the ClusterInstance.rds and ClusterRoleAssociation.rds resources, it affects the state of the associated ClusterInstace resource, which in turn prevents the dependent successful provisioning of the ClusterInstance.rds and ClusterRoleAssociation.rds resources.

This theory has been tested once by first separately provisioning all resources except the rds.ClusterActivityStream and then provisioning the remaining rds.ClusterActivityStream, which succeeded.

Here are the relevant manifests:

apiVersion: rds.aws.upbound.io/v1beta1
kind: ClusterInstance
metadata:
  annotations:
    crossplane.io/external-create-pending: "2022-08-25T04:22:23+03:00"
    crossplane.io/external-create-succeeded: "2022-08-25T04:22:23+03:00"
    crossplane.io/external-name: example
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rds.aws.upbound.io/v1beta1","kind":"ClusterInstance","metadata":{"annotations":{"upjet.upbound.io/manual-intervention":"This resource has a reference to Cluster, which needs manual intervention."},"name":"example"},"spec":{"forProvider":{"clusterIdentifierRef":{"name":"example"},"engine":"aurora-postgresql","identifier":"example","instanceClass":"db.r5.large","region":"us-west-1"}}}
    upjet.crossplane.io/provider-meta: '{"e2bfb730-ecaa-11e6-8f88-34363bc7c4c0":{"create":5400000000000,"delete":5400000000000,"update":5400000000000}}'
    upjet.upbound.io/manual-intervention: This resource has a reference to Cluster,
      which needs manual intervention.
  creationTimestamp: "2022-08-25T01:20:20Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generation: 4
  name: example
  resourceVersion: "117495"
  uid: 9ea16b29-0e0c-4e94-9b9d-8b3a66f618f2
spec:
  deletionPolicy: Delete
  forProvider:
    autoMinorVersionUpgrade: true
    availabilityZone: us-west-1c
    caCertIdentifier: rds-ca-2019
    clusterIdentifier: example
    clusterIdentifierRef:
      name: example
    dbParameterGroupName: default.aurora-postgresql13
    dbSubnetGroupName: default
    engine: aurora-postgresql
    engineVersion: "13.7"
    identifier: example
    instanceClass: db.r5.large
    preferredBackupWindow: 11:46-12:16
    preferredMaintenanceWindow: fri:06:31-fri:07:01
    region: us-west-1
    tags:
      crossplane-kind: clusterinstance.rds.aws.upbound.io
      crossplane-name: example
      crossplane-providerconfig: default
  providerConfigRef:
    name: default
status:
  atProvider:
    arn: arn:aws:rds:us-west-1:609897127049:db:example
    dbiResourceId: db-QR2KJLB6F5VUUAU7ZJOPYJ3SOE
    endpoint: example.c4asmkmn5yqi.us-west-1.rds.amazonaws.com
    engineVersionActual: "13.7"
    id: example
    kmsKeyId: ""
    port: 5432
    storageEncrypted: false
    tagsAll:
      crossplane-kind: clusterinstance.rds.aws.upbound.io
      crossplane-name: example
      crossplane-providerconfig: default
    writer: true
  conditions:
  - lastTransitionTime: "2022-08-25T01:27:49Z"
    message: 'observe failed: cannot run plan: plan failed: Instance cannot be destroyed:
      Resource aws_rds_cluster_instance.example has lifecycle.prevent_destroy set,
      but the plan calls for this resource to be destroyed. To avoid this error and
      continue with the plan, either disable lifecycle.prevent_destroy or reduce the
      scope of the plan using the -target flag.: File name: main.tf.json'
    reason: ReconcileError
    status: "False"
    type: Synced
  - lastTransitionTime: "2022-08-25T01:27:38Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-08-25T01:27:27Z"
    reason: Finished
    status: "True"
    type: AsyncOperation
  - lastTransitionTime: "2022-08-25T01:27:27Z"
    message: 'apply failed: unexpected state ''configuring-activity-stream'', wanted
      target ''available''. last error: %!s(<nil>): : File name: main.tf.json'
    reason: ApplyFailure
    status: "False"
    type: LastAsyncOperation

apiVersion: rds.aws.upbound.io/v1beta1
kind: ClusterRoleAssociation
metadata:
  annotations:
    crossplane.io/external-create-pending: "2022-08-25T04:22:26+03:00"
    crossplane.io/external-create-succeeded: "2022-08-25T04:22:27+03:00"
    crossplane.io/external-name: example,arn:aws:iam::609897127049:role/sample-db-role
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rds.aws.upbound.io/v1beta1","kind":"ClusterRoleAssociation","metadata":{"annotations":{"upjet.upbound.io/manual-intervention":"This resource has a reference to Cluster, which needs manual intervention."},"name":"example"},"spec":{"forProvider":{"dbClusterIdentifierRef":{"name":"example"},"featureName":"s3Import","region":"us-west-1","roleArnRef":{"name":"sample-db-role"}}}}
    upjet.crossplane.io/provider-meta: "null"
    upjet.upbound.io/manual-intervention: This resource has a reference to Cluster,
      which needs manual intervention.
  creationTimestamp: "2022-08-25T01:20:21Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generation: 2
  name: example
  resourceVersion: "117501"
  uid: 04a590f6-f51f-4013-be32-3af799117fea
spec:
  deletionPolicy: Delete
  forProvider:
    dbClusterIdentifier: example
    dbClusterIdentifierRef:
      name: example
    featureName: s3Import
    region: us-west-1
    roleArn: arn:aws:iam::609897127049:role/sample-db-role
    roleArnRef:
      name: sample-db-role
  providerConfigRef:
    name: default
status:
  atProvider:
    id: example,arn:aws:iam::609897127049:role/sample-db-role
  conditions:
  - lastTransitionTime: "2022-08-25T01:27:52Z"
    message: 'observe failed: cannot run plan: plan failed: Instance cannot be destroyed:
      Resource aws_rds_cluster_role_association.example has lifecycle.prevent_destroy
      set, but the plan calls for this resource to be destroyed. To avoid this error
      and continue with the plan, either disable lifecycle.prevent_destroy or reduce
      the scope of the plan using the -target flag.: File name: main.tf.json'
    reason: ReconcileError
    status: "False"
    type: Synced
  - lastTransitionTime: "2022-08-25T01:27:45Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-08-25T01:27:33Z"
    reason: Finished
    status: "True"
    type: AsyncOperation
  - lastTransitionTime: "2022-08-25T01:27:33Z"
    message: 'apply failed: error waiting for RDS DB Cluster (example) IAM Role (arn:aws:iam::609897127049:role/sample-db-role)
      Association to create: timeout while waiting for state to become ''ACTIVE''
      (last state: ''PENDING'', timeout: 5m0s): : File name: main.tf.json'
    reason: ApplyFailure
    status: "False"
    type: LastAsyncOperation

Known affected resources:

aws_route53_hosted_zone_dnssec

How can we reproduce it?

kubectl apply -f examples/rds, excluding proxy* resources.

luebken commented 1 year ago

Severity:low as we currently don't encounter this often. Please add your use-case if you encounter this.

JonathanO commented 10 months ago

This also happens with the GCP provider if you enable a ProjectService and then attempt to create a resource that uses the newly enabled API. Enabling an API on a project takes a while to fully propagate, and during that time the API will return API not enabled errors to random calls. This means it's possible to get as far as actually creating the GCP resource, but then be unlucky and fail one of the later calls in the provider, leading to the tainted state. The more calls that are needed to set up the resource, the more likely one of them will fail. This seems particularly noticeable with enabling the DNS ProjectService, combined with creating a ManagedZone and a RecordSet. In my tests the RecordSet ends up tainted around ~20% of the time.

jeanduplessis commented 10 months ago

@JonathanO is correct in saying the more calls needed to set up the resource, the more likely it is to end up in this tainted state. The TF untaint command mentions:

Terraform automatically marks an object as "tainted" if an error occurs during a multi-step "create" action, because Terraform can't be sure that the object was left in a fully-functional state.

We plan to address this issue upstream in the Crossplane Managed Reconciler to better accommodate the async operation we use in Upjet. This will allow us to ensure the reconciler applies the crossplane.io/external-create-failed annotation.

The TL;DR is that while we won't be able to completely stop resources from getting into a tainted state, we will make it more obvious when this happens, and it will then follow the usual Crossplane remediation approach for resolving it which entails manually confirming the external resource state, and then removing the annotation on the resource, to allow reconciliation to continue.

Making improvements here is currently on our Q1 roadmap.

crossplane / upjet

Upjet cannot handle tainted Terraform state #80

What happened?

How can we reproduce it?