crossplane / upjet

A code generation framework and runtime for Crossplane providers
Apache License 2.0
319 stars 89 forks source link

Deleting an MR with a pending RequiresNew change causes either a panic or recreation #420

Open mbbush opened 4 months ago

mbbush commented 4 months ago

What happened?

First reported in https://github.com/crossplane-contrib/provider-upjet-aws/issues/1360, but I've observed it with other MR kinds as well.

In the Delete method for a tf sdk managed resource, upjet sets n.instanceDiff.Destroy = true, then invokes

github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc0045795e0,
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.33.0/helper/schema/resource.go:909 

That Apply method is designed to handle both a "delete only" request, and a delete and recreate request. The delete implementation contains this code

        // If we're only destroying, and not creating, then return
        // now since we're done!
        if !d.RequiresNew() {
            return nil, diags
        }

which in this case doesn't trigger if the observed instance diff has RequiresNew = true due to a change to an immutable-by-terraform field, or due to a perpetual-diff bug that creates such a change, like https://github.com/crossplane-contrib/provider-upjet-aws/issues/1363. So then the function proceeds to attempt to recreate the desired state. In some cases, this successfully deletes and then recreates the resource, and then the next run of the reconciliation loop deletes it again. In other cases, such as in the linked issue, it causes a panic due to some unexpected input.

How can we reproduce it?

  1. Apply the example manifest for MultiRegionAccessPoint.s3control.aws.upbound.io/v1beta1
  2. Once the resource has been created and is in a loop of being unable to update a parameter which requires recreation, kubectl delete the managed resource.
  3. The provider pod will delete the external resource from aws, panic, restart, and then successfully delete the managed resource on the second try (because it will see that the resource is gone from aws)