aws-solutions / network-orchestration-for-aws-transit-gateway

The Network Orchestration for AWS Transit Gateway solution automates the process of setting up and managing transit networks in distributed AWS environments. It creates a web interface to help control, audit, and approve (transit) network changes.
https://aws.amazon.com/solutions/implementations/serverless-transit-network-orchestrator/
Apache License 2.0
113 stars 48 forks source link

DuplicateTransitGatewayAttachment error received on 1 of 3 subnets #1

Closed roger-reed closed 2 years ago

roger-reed commented 4 years ago

I create three subnets exactly the same way through CloudFormation and consistently 1 of the 3 is not being attached even though tag Attach-to-tgw is present. The subnets are in 3 different AZs. The error tag is showing the following:

STNOStatus-Subnet-Error | 2020-01-31T03:22:19Z: An error occurred (DuplicateTransitGatewayAttachment) when calling the CreateTransitGatewayVpcAttachment operation: tgw-04e93f5458d6a0662 has non-deleted Transit Gateway Attachments with same VPC ID.

I am able to correct the issue manually and attach to the subnet without issue.

Attaching CloudFormation template to produce the error.

Thanks for taking a look! k8s-stno-error-test-vpc.template.json.txt

groverlalit commented 4 years ago

Hello Roger, Thanks for reporting this issue. We have added this issue to our backlog.

naslanidis commented 4 years ago

Hi,

I am seeing the exact same behaviour:

"An error occurred (DuplicateTransitGatewayAttachment) when calling the CreateTransitGatewayVpcAttachment operation: tgw-003ded20995f40ef3 has non-deleted Transit Gateway Attachments with same VPC ID.",

I have 3 subnets and only 2 of them will attach. I thought this might be related to AZ ID's being different across different accounts but I'm not so sure. It seems that it's trying to create the attachment instead of adding additional subnets to an existing attachments.

Is this project still active and being worked on?

Thanks

EDIT: I did some further testing and if you add the tags to one subnet at a time and let the state machine execution run, it works fine every time. I.e. if I add subnet 1, then wait, then add subnet 2, then wait etc., all subnets are added perfectly. I tested this quite a few times. However when I used cloudformation to add the tags to 3 subnets at the same time it's unpredictable. Sometimes individual state machine executions work, other times they don't.

Looking at the state machine logs and lambda logs the issues arise when the 3 state machine executions are happening simultaneously. Those executions are getting in the way of each other, even with the various resource state checks implemented in the code. Just some example errors.

Subnet 1: 7:19:02:523 pm: "errorMessage": "An error occurred (IncorrectState) when calling the AssociateTransitGatewayRouteTable operation: tgw-attach-00fd6c35b75ff25bb is in invalid state"

Subnet 2: 7:19:03.538 pm: "errorMessage": "An error occurred (IncorrectState) when calling the EnableTransitGatewayRouteTablePropagation operation: tgw-attach-00fd6c35b75ff25bb is in invalid state",

Subnet 3: 7:19:08:293 pm "errorMessage": "An error occurred (Resource.AlreadyAssociated) when calling the AssociateTransitGatewayRouteTable operation: Transit Gateway Attachment tgw-attach-00fd6c35b75ff25bb is already associated to a route table.",

I badly need some automation for transit gateway and connecting a significant number of workload accounts so I will have a look myself to see if I can find a work around at least in the short term.

rakshb commented 4 years ago

Hello @naslanidis. Thanks for the note. This issue will be fixed in the next release of STNO planned for Q3 2020

naslanidis commented 4 years ago

Hello @naslanidis. Thanks for the note. This issue will be fixed in the next release of STNO planned for Q3 2020

Hi, thanks that's great news.

I've actually worked around it by simply adding a catch to some of the states in the state machine that were sometimes failing and just routing the flow back up top to retry again. Not perfect but it works and I look forward to seeing the next version.

adamcousins commented 4 years ago

@naslanidis can you share your workaround? im hitting this same error and have added 15 retry attempts on the step: TGW Attachment CRUD Operations and some other failing steps however one remaining subnet never gets the attachment after 15 retries.

@rakshb Any way to get access to the fix for this in a beta version or similar?

ghost commented 4 years ago

@naslanidis can you share your workaround? im hitting this same error and have added 15 retry attempts on the step: TGW Attachment CRUD Operations and some other failing steps however one remaining subnet never gets the attachment after 15 retries.

@rakshb Any way to get access to the fix for this in a beta version or similar?

I have also faced this issue. I have added dependency (DependsOn) in order to create subnets one by one and not to confuse STNO First of all I create subnet-A with tag (Attach-to-tgw), subnet-B DependsOn subnet-A, subnet-C DependsOn subnet-B, etc

naslanidis commented 4 years ago

Hi @bebych and @adamcousins

Initially I played around with the STNO lambda to work around this but then I decided I'd rather not make changes there if a new version will be coming soon. So instead I just added some retries / catches in the state machine spec. You can replace the original state machine json spec with the attached in the stno hub cfn and it should work. I haven't changed anything else. If there's an error or 'IncorrectState' response it simply restarts from the describe states at the start. It's not pretty, but given a new version should be coming I wanted to keep changes to a minimum.

state_machine_spec.txt

I've attached a diff output compared to the original spec so you can see what's changed.

state_machine_spec_diff.txt

dougireton commented 4 years ago

I've hit this same issue.

sreejanigit commented 3 years ago

If we try to attach a transit gateway to two subnets of the same availability zone, this error occurs. According to the rule , one transit gateway can attach to a subnet of one availability zone only. Please correct me if I am wrong

groverlalit commented 3 years ago

@sreejanigit This issue is related to duplicate TGW Attachment (DuplicateTransitGatewayAttachment) due to a race condition if more than 2 subnets are tagged at the same exact time (example: using CFN template). If we try to attach a transit gateway to two subnets in the same availability zone we should expect DuplicateSubnetsInSameZoneError exception.

sreejanigit commented 3 years ago

@groverlalit, sorry for my misunderstanding. Thanks much for correcting me.

badaldavda8 commented 3 years ago

Is this resolved yet? I am facing this constantly while creating VPC using CFN. I tried adding DependsOn, but it would still fail most of the times with "errorMessage": "An error occurred (Resource.AlreadyAssociated) when calling the AssociateTransitGatewayRouteTable operation: Transit Gateway Attachment..."

Or InvalidState error message.

adamcousins commented 2 years ago

@groverlalit is this project abandoned or will bugs like this be resolved in a timely fashion?

rayjanoka commented 2 years ago

same issue - what's up @groverlalit? Any word on that new release?

rayjanoka commented 2 years ago

the @naslanidis fix seemed to work for me, thx!

markymarkus commented 2 years ago

Yes, issue is still exists. Fresh install of v2.0.0. Trying to create VPC with 3 subnets from Cloudformation:

Step Functions: First subnet succeeded.

Second subnet: "errorMessage": "An error occurred (IncorrectState) when calling the AssociateTransitGatewayRouteTable operation: tgw-attach-0a123456789abcdefg is in invalid state",

Third subnet: "errorMessage": "An error occurred (Resource.AlreadyAssociated) when calling the AssociateTransitGatewayRouteTable operation: Transit Gateway Attachment tgw-attach-0a123456789abcdefg is already associated to a route table.",

gsingh04 commented 2 years ago

We are currently working on the fix. This issue will be addressed in the next release. Please continue to monitor this thread for updates.

rakshb commented 2 years ago

We have fixed this issue in V3.0.0 which released this week.