Open mjnovice opened 3 weeks ago
@mjnovice, could you please try changing the field identifier: ci-aseks6062802
?
@turkenf that helps, but that also creates another RDS no ? Is there a known cause for this issue ?
I could not reproduce the bug, when you get this error, are you sure that there is no other instance with the identifier ci-aseks6062802
?
@turkenf nope. There was none before applying the CR. This happens intermittently for me.
I've got a similar problem. I'm trying to create an RDS instance from another instance using restoreToPointInTime
apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
annotations:
crossplane.io/external-create-failed: '2024-06-19T10:42:51Z'
crossplane.io/external-create-pending: '2024-06-19T10:42:51Z'
crossplane.io/external-create-succeeded: '2024-06-19T09:18:22Z'
creationTimestamp: '2024-06-19T09:18:21Z'
finalizers:
- finalizer.managedresource.crossplane.io
generation: 2
name: obfuscated-c75kw-mssql-db
resourceVersion: '1261333276'
uid: e04aadb8-ed95-4ec9-b09e-940f6bc4ade2
selfLink: >-
/apis/rds.aws.upbound.io/v1beta2/instances/obfuscated-c75kw-mssql-db
status:
atProvider: {}
conditions:
- lastTransitionTime: '2024-06-19T09:18:22Z'
reason: Creating
status: 'False'
type: Ready
- lastTransitionTime: '2024-06-19T10:42:51Z'
message: "create failed: async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d []}]"
reason: ReconcileError
status: 'False'
type: Synced
- lastTransitionTime: '2024-06-19T10:42:51Z'
message: "async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d []}]"
reason: AsyncCreateFailure
status: 'False'
type: LastAsyncOperation
spec:
deletionPolicy: Delete
forProvider:
autoGeneratePassword: true
autoMinorVersionUpgrade: true
backupRetentionPeriod: 0
caCertIdentifier: rds-ca-rsa4096-g1
dbSubnetGroupName: prod1
deleteAutomatedBackups: true
engine: sqlserver-ee
engineVersion: ''
identifier: obfuscated-c75kw-mssql-db
instanceClass: db.r5.xlarge
multiAz: false
optionGroupName: prod1-obfuscated
region: eu-west-1
skipFinalSnapshot: true
tags:
crossplane-kind: instance.rds.aws.upbound.io
crossplane-name: obfuscated-c75kw-mssql-db
crossplane-providerconfig: default
vpcSecurityGroupIds:
- sg-xxxxxxxxxxxxxxx
initProvider:
restoreToPointInTime:
- sourceDbInstanceIdentifier: prod1-obfuscated-mssql-db
useLatestRestorableTime: true
managementPolicies:
- '*'
providerConfigRef:
name: default
writeConnectionSecretToRef:
name: obfuscated-c75kw-mssql-db
namespace: production
This worked correctly when I was using v1beta1 and the rds provider in version 0.47.0
. K8s cluster running 1.26
Edit: I've managed to get the controller to work correctly by updating the manager resource with the crossplane.io/external-name
annotation and setting it to the same value as metadata.name
and spec.forProvider.identifier
. @mjnovice something that might help you
Edit2: After the controller decided that there's actually no problem with the Instance resource it updated the crossplane.io/external-name annotation
as well to something more like an ID I'd expect crossplane.io/external-name=db-MVTTMGI6KFQ66F5X2DWJH7WNWQ
The terraform provider version 5.0 upgrade contained a breaking change to this resource, which we mitigated, but may not have been able to fully restore the terraform provider 4.x behavior.
The details of the change are both important and confusing. In terraform provider aws v4.x, the terraform id
was the user-defined 'identifier' field. In v5.0 they changed it to instead use the aws-generated database resource id. I think this code comment from the terraform aws provider does a pretty good job of explaining it:
// NOTE ON "ID", "IDENTIFIER":
// ID is overloaded and potentially confusing. Hopefully this clears it up.
// * ID, as in d.Id(), d.SetId(), is:
// - the same as AWS calls the "dbi-resource-id" a/k/a "database instance resource ID"
// - unchangeable/immutable
// - called either "id" or "resource_id" in schema/state (previously was only "resource_id")
// * "identifier" is:
// - user-defined identifier which AWS calls "identifier"
// - can be updated
// - called "identifier" in the schema/state (previously was also "id")
For an upjet-based crossplane provider, the format of the terraform id is very important. Ideally, we can construct it from the metadata.name
of the resource, or from its spec
. For resources where we have configured the provider to do this, when a managed resource is created, the provider can then (more or less) run a terraform import
to determine if there's an existing resource with this id. If there is, the provider assumes control of it, and makes the necessary update calls to make it match the desired spec
. If not, it creates a new resource.
In terraform provider aws v4.x, which used the identifier
field (which is user-defined) as the terraform id, it was easy to construct the terraform id from the spec
of the Instance.rds
managed resource, and we did so. But in terraform provider v5.0, they changed the provider to instead use the database resource id (which is generated by AWS) as the terraform id. This id is not in the spec
of the managed resource, nor is it possible/reasonable to expect the user to always provide it. So instead we switched to use the IdentifierFromProvider
external name configuration for this resource.
IdentifierFromProvider
should still work just fine. We use it for a ton of resource kinds. It's close to if not the most common external name configuration in provider-upjet-aws. But it does mean that versions 0.x of this crossplane provider could reliably detect (by making an api call) whether a resource already existed before trying to create one, and versions 1.x of this provider cannot.
I have not yet tried to reproduce this issue, but my working theory is that the sequence of events goes like this:
my-db-identifier
. DBInstanceAlreadyExists
error while somehow not actually being created), but the provider somehow loses track of it. DBInstanceAlreadyExists
error.If my theory is correct (which I hope some answers below can help confirm or refute), then the remaining task is to figure out what's happening in step 2, and make it not happen. If we're lucky, it could be as simple as a timeout configuration needing to be tweaked, but it may end up being much more complicated.
Some questions for those of you who have experienced this issue:
identifier
set in your spec.forProvider
actually exist in RDS? Is there any noticeable difference from the state in the managed resource? Does it have the crossplane-added tags?Instance
managed resource and getting the error? --debug
argument (using either a ControllerConfig or DeploymentRuntimeConfig), and can provide debug logs reproducing the issue, that would be very helpful.I am attaching debug logs from the RDS provider that mention this instance.
Explore-logs-2024-06-20 12_08_32.txt
To further add to that, from the events, it seems that after 30 minutes, problems start appearing. The point-in-time restore takes a fair bit longer for that database than 30 minutes. The instance itself is configured correctly. All changes specified in the manifest are there. Right now, it happens each time I try to recreate that database, which is around once per day.
Ok, one thing that actually is misconfigured compared with the manifest is the backup retention window. In the manifest, I set backupRetentionPeriod
to 0, and the instance has 7. It couldn't set to 0, because apparently you can't disable backups when an instance is restored or something. I honestly don't know why that'd be the case but I changed the retention to 1 and it passed. Now the provider is happy about the resource. I wonder if the providers inability to apply that setting had some part in this. Will test that out once I get the chance.
Thanks for the logs @PatTheSilent, they were really helpful.
What I see confirms my working theory above. It looks like the request to create the database instance is timing out after about 40 minutes, and then the provider is repeatedly trying to make more Create
api calls, which all fail quickly due to the duplicate identifier, then are retried immediately, fail again, etc. The terraform provider configures a 40 minute timeout for Create
requests for this resource. It sounds like in your case, that's not long enough.
This is also consistent with your finding about backupRetentionPeriod
, as the terraform aws provider applies that parameter after the initial db creation call, when restoring a db from a snapshot (because the particular RDS api command for restoring a db from a snapshot doesn't support that parameter). If the Create timed out, that would never get called.
I'm not sure, but I don't think there's a user-facing way for you to adjust the timeout, and it can only be adjusted in the provider source code. That might be a good enhancement to add to upjet. It seems like this is a resource that could potentially take a very long time to create, especially if you're restoring from a large snapshot. Can you tell how long it's taking this RDS instance to actually be created?
There's an additional problem, which is that this situation should result in the provider adding the external create incomplete annotation, raising an error, and stopping further reconciliation of this resource, because it's lost track of the resource it created. That's not happening, and I want to figure out why.
@mbbush from what I could see, it took over an hour for it to be available. It's a MS SQL database with a few TB of data. And yeah, it would be great if we could configure timeouts like in Terraform, maybe via an annotation. Let me know if I can help you further in any way.
Is there an existing issue for this?
Affected Resource(s)
No response
Resource MRs required to reproduce the bug
Steps to Reproduce
What happened?
Relevant Error Output Snippet