[Bug]: provider-aws-s3control:v1.6.0. Pod restarts when MultiRegionAccessPoint deletion is requested

oliabent commented 2 weeks ago

Is there an existing issue for this?

[X] I have searched the existing issues

Affected Resource(s)

xpkg.upbound.io/upbound/provider-aws-s3control:v1.6.0

Resource MRs required to reproduce the bug

apiVersion: pkg.crossplane.io/v1 kind: Provider metadata: name: provider-aws-s3control spec: package: xpkg.upbound.io/upbound/provider-aws-s3control:v1.6.0

apiVersion: s3control.aws.upbound.io/v1beta1 kind: MultiRegionAccessPoint metadata: name: {{ $multiRegionAccessPointName }} annotations: gotemplating.fn.crossplane.io/composition-resource-name: multiregionaccesspoint spec: forProvider: details:

name: {{ $multiRegionAccessPointName }} region: {{- range $bucket := $multiRegionAccessPointBucketNames }}
- bucketRef: name: {{ $bucket }} {{- end }} region: {{ $multiRegionAccessPointRegion }} providerConfigRef: name: {{ $providerConfigName }}

Steps to Reproduce

request deletion of resource: kubectl delete...

What happened?

pod crashes and restarts

Relevant Error Output Snippet

xpkg.upbound.io/upbound/provider-aws-s3control:v1.6.0

When request for deletion MultiRegionAccessPoint is sent we receive logs:

`panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1204e132]

goroutine 11146 [running]:
github.com/hashicorp/terraform-provider-aws/internal/service/s3control.resourceMultiRegionAccessPointCreate({0x195b9b18, 0xc009170300}, 0xc003752400, {0x167944c0, 0xc006528820})
    github.com/hashicorp/terraform-provider-aws@v0.0.0-00010101000000-000000000000/internal/service/s3control/multi_region_access_point.go:168 +0x212
github.com/hashicorp/terraform-provider-aws/internal/provider.New.(*wrappedResource).Create.interceptedHandler[...].func8(0xc003752400?, {0x167944c0?, 0xc006528820})
    github.com/hashicorp/terraform-provider-aws@v0.0.0-00010101000000-000000000000/internal/provider/intercept.go:113 +0x283
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).create(0x195b9bc0?, {0x195b9bc0?, 0xc000b28540?}, 0xd?, {0x167944c0?, 0xc006528820?})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.33.0/helper/schema/resource.go:773 +0x7a
github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc0045795e0, {0x195b9bc0, 0xc000b28540}, 0xc0025852b0, 0xc00a512780, {0x167944c0, 0xc006528820})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.33.0/helper/schema/resource.go:909 +0xa89
github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKExternal).Delete(0xc00bf28540, {0x195b9bc0, 0xc000b28540}, {0xc00c4af560?, 0xc0097e6cc0?})
    github.com/crossplane/upjet@v1.4.1/pkg/controller/external_tfpluginsdk.go:714 +0x146
github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKAsyncExternal).Delete.func1()
    github.com/crossplane/upjet@v1.4.1/pkg/controller/external_async_tfpluginsdk.go:201 +0x145
created by github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKAsyncExternal).Delete in goroutine 4567
    github.com/crossplane/upjet@v1.4.1/pkg/controller/external_async_tfpluginsdk.go:197 +0x22c`

after that pod is restarted

Crossplane Version

v1.14.3

Provider Version

v1.6.0

Kubernetes Version

v1.29.4

Kubernetes Distribution

EKS

Additional Info

No response

mbbush commented 1 week ago

I can reproduce the issue running uptest. It requires manually kubectl deleteing the MultiRegionAccessPoint because it otherwise gets stuck due to #1363. We should probably handle both issues together, as they seem likely to have related solutions.

For each managed resource I delete, the provider pod restarts once, then successfully deletes the external resource, resolves the finalizer, and completes deletion of the managed resource. This makes me think that the reason for the panic is somehow related to the erroneous reconcile loop described in #1363, and that when the provider restarts with the managed resource already having a deletion timestamp (but not yet completely deleted), it doesn't trigger that reconcile loop, and instead successfully deletes the external resource.

mbbush commented 1 week ago

Looking at the implementation of the lines of code mentioned in the stacktrace, I've figured out what's happening.

in

github.com/crossplane/upjet/pkg/controller.(*terraformPluginSDKExternal).Delete(0xc00bf28540, {0x195b9bc0, 0xc000b28540}, {0xc00c4af560?, 0xc0097e6cc0?})
    github.com/crossplane/upjet@v1.4.1/pkg/controller/external_tfpluginsdk.go:714 +0x146

upjet sets n.instanceDiff.Destroy = true, then invokes

github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema.(*Resource).Apply(0xc0045795e0, {0x195b9bc0, 0xc000b28540}, 0xc0025852b0, 0xc00a512780, {0x167944c0, 0xc006528820})
    github.com/hashicorp/terraform-plugin-sdk/v2@v2.33.0/helper/schema/resource.go:909 +0xa89

That Apply method is designed to handle both a "delete only" request, and a delete and recreate request. The delete implementation contains this code

        // If we're only destroying, and not creating, then return
        // now since we're done!
        if !d.RequiresNew() {
            return nil, diags
        }

which in this case doesn't trigger, because the observed instance diff has RequiresNew = true due to #1363. So then the function proceeds to attempt to recreate the desired state, which is (presumably; I didn't verify this) invalid and causes a panic due to some unexpected input. This also explains why there's a call to Create inside a Delete method.

I think we should update upjet to explicitly set RequiresNew to false when deleting sdk resources, as this could otherwise produce unexpected results even if it doesn't cause a panic.

crossplane-contrib / provider-upjet-aws