[Bug]: RDS Instance creation, resource already exists

mjnovice commented 3 weeks ago

Is there an existing issue for this?

[X] I have searched the existing issues

Affected Resource(s)

No response

Resource MRs required to reproduce the bug

apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
  annotations:
    crossplane.io/external-create-pending: "2024-06-06T17:43:44Z"
    crossplane.io/external-create-succeeded: "2024-06-06T17:43:44Z"
  creationTimestamp: "2024-06-06T15:55:19Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generateName: ci-aseks6062802
  generation: 3
  labels:
    clusterId: ci-aseks6062802
  name: ci-aseks6062802qrvcs
  resourceVersion: "67343"
  uid: 48919564-24f7-43fb-b1ba-9dbbd8a9b1fa
spec:
  deletionPolicy: Delete
  forProvider:
    allocatedStorage: 256
    applyImmediately: true
    autoGeneratePassword: true
    backupRetentionPeriod: 3
    copyTagsToSnapshot: true
    dbSubnetGroupName: ci-aseks6062802k28ws
    dbSubnetGroupNameRef:
      name: ci-aseks6062802k28ws
    dbSubnetGroupNameSelector:
      matchLabels:
        clusterId: ci-aseks6062802
    engine: sqlserver-se
    engineVersion: 15.00.4236.7.v1
    identifier: ci-aseks6062802
    instanceClass: db.m5.xlarge
    licenseModel: license-included
    maxAllocatedStorage: 1000
    multiAz: false
    passwordSecretRef:
      key: password
      name: ci-aseks6062802
      namespace: sfd-ci-aseks6062802
    publiclyAccessible: true
    region: us-east-1
    skipFinalSnapshot: true
    storageEncrypted: true
    storageType: gp2
    username: testadmin
    vpcSecurityGroupIdRefs:
    - name: ci-aseks6062802-redis-mssql-sg2qb9w
    vpcSecurityGroupIdSelector:
      matchLabels:
        clusterId: ci-aseks6062802
        crossplane.io/forResource: redis-mssql
    vpcSecurityGroupIds:
    - sg-0ee65bd8753e2fdb8
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: service-fabric
  writeConnectionSecretToRef:
    name: mssql-creds
    namespace: sfd-ci-aseks6062802

Steps to Reproduce

Created an RDS Instance on a kind cluster running crossplane.
It gets stuck sometimes with a resource already exists error, even though initially there was no such instance.

What happened?

A single RDS Instance should have been created.

Relevant Error Output Snippet

- lastTransitionTime: "2024-06-06T17:44:46Z"
    message: "async create failed: failed to create the resource: [{0 creating RDS
      DB Instance (ci-aseks6062802): DBInstanceAlreadyExists: DB instance already
      exists\n\tstatus code: 400, request id: 1be9b4f8-31de-42e5-89f5-99cdd8885597
      \ []}]"



### Crossplane Version

1.16.1

### Provider Version

1.2.0

### Kubernetes Version

1.29

### Kubernetes Distribution

Kind

### Additional Info

_No response_

turkenf commented 3 weeks ago

@mjnovice, could you please try changing the field identifier: ci-aseks6062802?

mjnovice commented 3 weeks ago

@turkenf that helps, but that also creates another RDS no ? Is there a known cause for this issue ?

turkenf commented 3 weeks ago

I could not reproduce the bug, when you get this error, are you sure that there is no other instance with the identifier ci-aseks6062802?

mjnovice commented 3 weeks ago

@turkenf nope. There was none before applying the CR. This happens intermittently for me.

PatTheSilent commented 1 week ago

I've got a similar problem. I'm trying to create an RDS instance from another instance using restoreToPointInTime

apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
  annotations:
    crossplane.io/external-create-failed: '2024-06-19T10:42:51Z'
    crossplane.io/external-create-pending: '2024-06-19T10:42:51Z'
    crossplane.io/external-create-succeeded: '2024-06-19T09:18:22Z'
  creationTimestamp: '2024-06-19T09:18:21Z'
  finalizers:
    - finalizer.managedresource.crossplane.io
  generation: 2
  name: obfuscated-c75kw-mssql-db
  resourceVersion: '1261333276'
  uid: e04aadb8-ed95-4ec9-b09e-940f6bc4ade2
  selfLink: >-
    /apis/rds.aws.upbound.io/v1beta2/instances/obfuscated-c75kw-mssql-db
status:
  atProvider: {}
  conditions:
    - lastTransitionTime: '2024-06-19T09:18:22Z'
      reason: Creating
      status: 'False'
      type: Ready
    - lastTransitionTime: '2024-06-19T10:42:51Z'
      message: "create failed: async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d  []}]"
      reason: ReconcileError
      status: 'False'
      type: Synced
    - lastTransitionTime: '2024-06-19T10:42:51Z'
      message: "async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d  []}]"
      reason: AsyncCreateFailure
      status: 'False'
      type: LastAsyncOperation
spec:
  deletionPolicy: Delete
  forProvider:
    autoGeneratePassword: true
    autoMinorVersionUpgrade: true
    backupRetentionPeriod: 0
    caCertIdentifier: rds-ca-rsa4096-g1
    dbSubnetGroupName: prod1
    deleteAutomatedBackups: true
    engine: sqlserver-ee
    engineVersion: ''
    identifier: obfuscated-c75kw-mssql-db
    instanceClass: db.r5.xlarge
    multiAz: false
    optionGroupName: prod1-obfuscated
    region: eu-west-1
    skipFinalSnapshot: true
    tags:
      crossplane-kind: instance.rds.aws.upbound.io
      crossplane-name: obfuscated-c75kw-mssql-db
      crossplane-providerconfig: default
    vpcSecurityGroupIds:
      - sg-xxxxxxxxxxxxxxx
  initProvider:
    restoreToPointInTime:
      - sourceDbInstanceIdentifier: prod1-obfuscated-mssql-db
        useLatestRestorableTime: true
  managementPolicies:
    - '*'
  providerConfigRef:
    name: default
  writeConnectionSecretToRef:
    name: obfuscated-c75kw-mssql-db
    namespace: production

This worked correctly when I was using v1beta1 and the rds provider in version 0.47.0. K8s cluster running 1.26

Edit: I've managed to get the controller to work correctly by updating the manager resource with the crossplane.io/external-name annotation and setting it to the same value as metadata.name and spec.forProvider.identifier. @mjnovice something that might help you

Edit2: After the controller decided that there's actually no problem with the Instance resource it updated the crossplane.io/external-name annotation as well to something more like an ID I'd expect crossplane.io/external-name=db-MVTTMGI6KFQ66F5X2DWJH7WNWQ

mbbush commented 1 week ago

The terraform provider version 5.0 upgrade contained a breaking change to this resource, which we mitigated, but may not have been able to fully restore the terraform provider 4.x behavior.

The details of the change are both important and confusing. In terraform provider aws v4.x, the terraform id was the user-defined 'identifier' field. In v5.0 they changed it to instead use the aws-generated database resource id. I think this code comment from the terraform aws provider does a pretty good job of explaining it:

// NOTE ON "ID", "IDENTIFIER":
// ID is overloaded and potentially confusing. Hopefully this clears it up.
// * ID, as in d.Id(), d.SetId(), is:
//    - the same as AWS calls the "dbi-resource-id" a/k/a "database instance resource ID"
//    - unchangeable/immutable
//    - called either "id" or "resource_id" in schema/state (previously was only "resource_id")
// * "identifier" is:
//    - user-defined identifier which AWS calls "identifier"
//    - can be updated
//    - called "identifier" in the schema/state (previously was also "id")

For an upjet-based crossplane provider, the format of the terraform id is very important. Ideally, we can construct it from the metadata.name of the resource, or from its spec. For resources where we have configured the provider to do this, when a managed resource is created, the provider can then (more or less) run a terraform import to determine if there's an existing resource with this id. If there is, the provider assumes control of it, and makes the necessary update calls to make it match the desired spec. If not, it creates a new resource.

In terraform provider aws v4.x, which used the identifier field (which is user-defined) as the terraform id, it was easy to construct the terraform id from the spec of the Instance.rds managed resource, and we did so. But in terraform provider v5.0, they changed the provider to instead use the database resource id (which is generated by AWS) as the terraform id. This id is not in the spec of the managed resource, nor is it possible/reasonable to expect the user to always provide it. So instead we switched to use the IdentifierFromProvider external name configuration for this resource.

IdentifierFromProvider should still work just fine. We use it for a ton of resource kinds. It's close to if not the most common external name configuration in provider-upjet-aws. But it does mean that versions 0.x of this crossplane provider could reliably detect (by making an api call) whether a resource already existed before trying to create one, and versions 1.x of this provider cannot.

I have not yet tried to reproduce this issue, but my working theory is that the sequence of events goes like this:

The provider creates a new RDS instance with the user-supplied identifier my-db-identifier.
Somehow, that instance gets created (or into a state that results in the reported DBInstanceAlreadyExists error while somehow not actually being created), but the provider somehow loses track of it.
The provider, thinking there is no existing instance, tries to create it again.
AWS returns the DBInstanceAlreadyExists error.

If my theory is correct (which I hope some answers below can help confirm or refute), then the remaining task is to figure out what's happening in step 2, and make it not happen. If we're lucky, it could be as simple as a timeout configuration needing to be tweaked, but it may end up being much more complicated.

Some questions for those of you who have experienced this issue:

When it happens, does the database instance with the identifier set in your spec.forProvider actually exist in RDS? Is there any noticeable difference from the state in the managed resource? Does it have the crossplane-added tags?
How much time elapses between creating the Instance managed resource and getting the error?
Some have stated that this is an intermittent issue. Roughly how often does it happen? Almost but not quite all the time? Rarely? Somewhere in between?
If you can run the provider-aws-rds pod with the --debug argument (using either a ControllerConfig or DeploymentRuntimeConfig), and can provide debug logs reproducing the issue, that would be very helpful.

PatTheSilent commented 1 week ago

I am attaching debug logs from the RDS provider that mention this instance.

Explore-logs-2024-06-20 12_08_32.txt

To further add to that, from the events, it seems that after 30 minutes, problems start appearing. The point-in-time restore takes a fair bit longer for that database than 30 minutes. The instance itself is configured correctly. All changes specified in the manifest are there. Right now, it happens each time I try to recreate that database, which is around once per day.

PatTheSilent commented 1 week ago

Ok, one thing that actually is misconfigured compared with the manifest is the backup retention window. In the manifest, I set backupRetentionPeriod to 0, and the instance has 7. It couldn't set to 0, because apparently you can't disable backups when an instance is restored or something. I honestly don't know why that'd be the case but I changed the retention to 1 and it passed. Now the provider is happy about the resource. I wonder if the providers inability to apply that setting had some part in this. Will test that out once I get the chance.

mbbush commented 1 week ago

Thanks for the logs @PatTheSilent, they were really helpful.

What I see confirms my working theory above. It looks like the request to create the database instance is timing out after about 40 minutes, and then the provider is repeatedly trying to make more Create api calls, which all fail quickly due to the duplicate identifier, then are retried immediately, fail again, etc. The terraform provider configures a 40 minute timeout for Create requests for this resource. It sounds like in your case, that's not long enough.

This is also consistent with your finding about backupRetentionPeriod, as the terraform aws provider applies that parameter after the initial db creation call, when restoring a db from a snapshot (because the particular RDS api command for restoring a db from a snapshot doesn't support that parameter). If the Create timed out, that would never get called.

I'm not sure, but I don't think there's a user-facing way for you to adjust the timeout, and it can only be adjusted in the provider source code. That might be a good enhancement to add to upjet. It seems like this is a resource that could potentially take a very long time to create, especially if you're restoring from a large snapshot. Can you tell how long it's taking this RDS instance to actually be created?

There's an additional problem, which is that this situation should result in the provider adding the external create incomplete annotation, raising an error, and stopping further reconciliation of this resource, because it's lost track of the resource it created. That's not happening, and I want to figure out why.

PatTheSilent commented 1 week ago

@mbbush from what I could see, it took over an hour for it to be available. It's a MS SQL database with a few TB of data. And yeah, it would be great if we could configure timeouts like in Terraform, maybe via an annotation. Let me know if I can help you further in any way.

crossplane-contrib / provider-upjet-aws