aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.57k stars 3.88k forks source link

rds: refreshing VPC context generates inconsistent ordering that causes unnecessary DBProxy resource replacement #25570

Open ahammond opened 1 year ago

ahammond commented 1 year ago

Describe the bug

We do a subnet selection to get our subnets. The behaviour is non-deterministic. Cfn is too stupid to treat the list as a set, so... we get stuff like this. As you can see below it wants to replace our proxy but can't because it's a named resource. The reason it wants to replace the resource is that the order of the subnets in the list changed.

[~] AWS::RDS::DBProxy GlobalStagingUsWest2/DochistoryAuroraStg/Or1DochistoryStg/Proxy Or1DochistoryStgProxy87EB58B0 replace
 └─ [~] VpcSubnetIds (requires replacement)
     └─ @@ -4,9 +4,9 @@
        [ ]   "subnet-07260952ef7787a6d",
        [ ]   "subnet-0b868f96519464bdb",
        [ ]   "subnet-0d9a398b9a31f6fcc",
+       [+]   "subnet-09a4826d4eb84d03c",
        [ ]   "subnet-0c8fd644d2a2f91c0",
        [ ]   "subnet-0656782790830be94",
-       [-]   "subnet-08a104b7791c32f08",
-       [-]   "subnet-09a4826d4eb84d03c",
-       [-]   "subnet-029c625dae0eb4783"
+       [+]   "subnet-029c625dae0eb4783",
+       [+]   "subnet-08a104b7791c32f08"
        [ ] ]

As a result, the stack update is broken and we can't roll forward actual changes.

Expected Behavior

Cfn shouldn't differentiate between sets and lists and not treat sets like the order matters.

CDK should compensate for Cfn's incompetence by always providing sets as ordered lists. Unfortunately this is probably a breaking change. But you could put it behind a feature flag, I guess.

Current Behavior

No idea what caused the order of the subnets to change.

Reproduction Steps

No idea what caused the order of the subnets to change.

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.78.0

Framework Version

No response

Node.js Version

16.20

OS

Mac

Language

Typescript

Language Version

4.9.5

Other information

No response

peterwoodworth commented 1 year ago

Thanks for reporting this, I'm not sure I've seen this occur before, and in fact I was talking with a separate customer that based on my understanding, the order of subnets should persist (that was for a different scenario however). I would consider this a bug on our end, because in the case someone wants to truncate the list of subnets, ordering would matter

We'll try to investigate this, are there any changes you made that could potentially explain this, or help with our investigation?

ahammond commented 1 year ago

Hmm. My best guess is that something happened to change the cdk.context.json file? I assume this is where you are persisting order? If so, that might be my workaround: tweak the contents of the context file. But... ???

ahammond commented 1 year ago

Yup, huge changes in the cdk.context.json file.

peterwoodworth commented 1 year ago

Yes, this would absolutely explain it. With the way context works, I'm not really sure how we'd address this. Clearing the context is what allows the context calls to be refreshed again, and at that point we would have no idea what the previous ordering was (additionally, we probably shouldn't care about the previous state anyway for most cases). The only workaround I have in mind off the top of my head is to manually edit the context file yourself once the context is fetched (i.e. cdk synth to generate context, manually edit file, then cdk deploy), but that's obviously not a good experience.

In the context provider we sort the azs, maybe we should do this with subnets as well? I'm not sure if the describeSubnets API call guarantees the same ordering or not. If not, then I would think we need to sort this response as well so that the ordering can remain consistent

https://github.com/aws/aws-cdk/blob/5ccc56975c323ea19fd0917def51184e13f440d9/packages/aws-cdk/lib/context-providers/vpcs.ts#L70

ahammond commented 1 year ago

Yeah, I think you'd want to sort the subnets the same way, but... probably better put it behind a FF or provide some kind of migration path. Would be pretty great if Cfn didn't suck in the first place. :)

peterwoodworth commented 1 year ago

i've reported this internally to cfn, let's see what they say (P88836143)

ahammond commented 1 year ago

@peterwoodworth any news?

peterwoodworth commented 1 year ago

I still can't share any updates unfortunately @ahammond, your TAM might be able to help here, I have shared the ticket with them

evgenyka commented 1 year ago

CDK do ordering elsewhere, so we probably need to do it here as well on CDK side.

peterwoodworth commented 1 year ago

I believe the service team is going to address this soon @evgenyka, we escalated and they identified the issue on their end.

ahammond commented 1 year ago

We implemented https://github.com/time-loop/clickup-projen/tree/main/docs/cdk-context-json which largely eliminates the issue we had with devs just copy-pasting their cdk.context.json files around. But... this is still a foot-cannon. Cfn really should sets for things like this where order should be irrelevant.