aws-solutions / network-orchestration-for-aws-transit-gateway

The Network Orchestration for AWS Transit Gateway solution automates the process of setting up and managing transit networks in distributed AWS environments. It creates a web interface to help control, audit, and approve (transit) network changes.
https://aws.amazon.com/solutions/implementations/serverless-transit-network-orchestrator/
Apache License 2.0
113 stars 48 forks source link

STNO Configuration Issue Causing Downtime During VPC RT CIDR Updates - Version- 3.2.0 #124

Closed bamishr closed 1 month ago

bamishr commented 1 month ago

Describe the bug We have deployed version 3.2.0 of STNO in our existing environment. Whenever I launch a new VPC, it handles TGW attachment, association, propagation, and adds the CIDR blocks from the ListOfCustomCidrBlocks to the route table.

Now, I need to add a new CIDR to the ListOfCustomCidrBlocks, which I will pass through parameters and deploy. The problem is that to update the VPC route tables of my existing deployed VPCs, I need to remove the subnet tag - AttachmentTag (Attach-to-tgw) first to trigger the event. However, as soon as I remove the tag, it deletes all the entries from the VPC route tables.

When I add the tag back to the subnet, it re-adds the existing and new entries (which is fine).

The issue here is that for the prod accounts, there will be a 4-5 minute downtime when I remove the tag and all entries are deleted. How can we handle this situation?

To Reproduce

1- deploy the stno hub templates in hub account and spoke template in spoke account. 2- add the parameter to ListOfCustomCidrBlocks 5 CIDR ranges and deploy vpc in spoke account with the tags to add the VPC route table entries in it automatically. 3- add a new CIDR range in parameter to ListOfCustomCidrBlocks (now total 6 CIDRs) .. now delete the subnet tag to invoke the event so that the VPC RT will get updated.. but this will delete the existing all entries. ( Here we have Downtime) 4- when you add the tag back all entries will be there.

So the problem is downtime at step 3.

Expected behavior

It should not delete the existing entry from VPC Routetable. and doing something with any other tag can update the new entry in the VPC RT.

Please complete the following information about the solution:

gsingh04 commented 1 month ago

Thank you @bamishr for opening the issue. Please allow us to investigate it. We will update the thread if further information is needed.

gsingh04 commented 1 month ago

@bamishr could you update to latest solution version 3.3.7 and confirm you continue to see the behavior.

bamishr commented 1 month ago

@gsingh04 - Thanks for replying, but it's not possible for me to update it directly right now because all our accounts are connected to STNO, which is live. It would be risky to do so without confirmation. Did you replicated the issue?

gsingh04 commented 1 month ago

I understand the issue, we need to check for diff in route table entries and introduce the changes, rather than deleting/adding all entries. We are identifying fix or a reasonable workaround for this behavior.

bamishr commented 1 month ago

@gsingh04 and @groverlalit - Thanks for picking this. Could you please tell us by when this will be fixed? An approximate date would be helpful because we need to plan accordingly and this is very critical for us.

gockle commented 1 month ago

Hi bamsihr,

The solution was intentionally designed to isolate network changes (route table changes) from Cloudformation stack changes. In this case changes to the Cloudformation parameter ListOfCustomCidrBlocks will not propagate to the VPC route table, until the tags are removed and added again. This does remove the routes and adds them again based on the new values in the CloudFormation parameter. To workaround this behavior you can switch to using managed prefix list. You can review the options in the CloudFormation template there is a parameter to provide the managed prefix lists in parameter CustomerManagedPrefixListIds. The CIDR list can be configured/managed in the prefix list.

Please note

  1. Changes to the prefix list will immediately apply to all the route tables.
  2. Managed Prefix list needs to be shared in RAM for the organization.

To remediate downtime issue, you can follow given steps

  1. Create managed prefix list and share it through RAM
  2. Update Cfn stack to use the managed prefix list. This will future-proof new VPCs/subnets getting onboarded on the solution and make needed route table updates using prefix list.
  3. Manually add needed route using prefix list created in previous step, in your existing subnet route tables.
  4. Remove the old destinations with previous CIDR blocks that use TGW as target in the same existing subnet route tables. (this step has to be performed since if the managed prefix list is updated the change might not reflect for the route table since it will still have the static routes in the table) Please review documentation for route table priority.
gockle commented 1 month ago

Closing this as the behavior being observed for ListOfCustomCidrBlocks updates is consistent to the solution design.