Open Cupidazul opened 1 year ago
Good morning, Thanks for providing us the feedback and sharing details of this event.
It seems the user tagged a VPC that already had a route to a different transit gateway (from the one defined in the STNO solution stack).
By design, the solution creates route(s) in the VPC route table using the customer defined (CFN Params) destination and target. If the destination (example, 0/0, RFC-1918, Custom destinations) already exist in the route table with another target, STNO does not add the route or update the route with new destination to avoid impact to the customer network.
The supported use cases does not access all the existing routes and evaluate of all the possible scenarios. It would great if you can open a feature request to help us learn about new use cases.
As per the ticket, if we expect users to tag VPCs with existing routes to different TGWs, then using Configure-Manually option is the best option. IMO, if majority of the VPCs are new that needs to be attached with TGW, I would recommend to skip tagging the VPC with existing routes to unmanaged TGWs. Second option could be the use of "Conditional" tags for TGW route tables. For example, if the VPCs that have existing routes to different TGW are in a different OU in the AWS Organization, you can use the complaince feature to auto-reject attachments and avoid impact to the network. (2, 3)
For 5, deletion of the routes added by the STNO workflow once you remove the tag is by design. The _delete_route function should add (this message)[https://github.com/aws-solutions/network-orchestration-for-aws-transit-gateway/blob/main/source/lambda/tgw_vpc_attachment/lib/handlers/vpc_handler.py#L270-L274] in the logs. Please feel free to share how we can improve this message.
For 6, 7 We have added backlog items to add specific log messages in the log group.
For 8, we are thinking about how we can help STNO user detection issues sooner. Any input/guidance based on your experience with STNO would be great.
If you prefer to share more details via a support ticket/TAM/SA, can you please refer the GitHub issue id to help us connect the two tickets.
Good Morning Sir.,
And thank you for your comprehensive reply.
"By design, the solution creates route(s) in the VPC route table using the customer defined (CFN Params) destination and target. If the destination (example, 0/0, RFC-1918, Custom destinations) already exist in the route table with another target, STNO does not add the route or update the route with new destination to avoid impact to the customer network."
This was exactly what we saw, probably our STNO setup did delete the route 10.0.0.0/8 at some point, as we can see in the logs (but without impact to the customer), then when STNO process succeeded at 6/11 we saw the 10.0.0.0/8 being added. Before 6/11 we had issues with STNO Tagging (issue was: we were only tagging the VPCs and approving while the SubNets were not yet tagged).
Our setup is: "DefaultRoute" = "Custom-Destinations" CIDR_BLOCKS = "10.0.0.0/8"
I was able to filter the logs to give us a better view of this:
02/11/2023 16:18:51 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:18:54 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:19:02 | RouteTableId': 'rtb-00ccccccccccccccc' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:24:59 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:24:59 | Removing destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx from the route table: rtb-00bbbbbbbbbbbbbbb
02/11/2023 16:25:03 | RouteTableId': 'rtb-00ccccccccccccccc' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:25:03 | Removing destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx from the route table: rtb-00ccccccccccccccc
02/11/2023 16:25:04 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:25:04 | Removing destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx from the route table: rtb-00aaaaaaaaaaaaaaa
02/11/2023 16:32:46 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:32:46 | Adding route: 10.0.0.0/8
02/11/2023 16:32:49 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:32:49 | Adding route: 10.0.0.0/8
02/11/2023 16:56:31 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:56:31 | Adding route: 10.0.0.0/8
02/11/2023 16:58:41 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:58:41 | Adding route: 10.0.0.0/8
02/11/2023 16:58:48 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:58:48 | Adding route: 10.0.0.0/8
02/11/2023 16:58:52 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:58:52 | Adding route: 10.0.0.0/8
02/11/2023 17:33:26 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 17:33:26 | Adding route: 10.0.0.0/8
02/11/2023 17:33:28 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 17:33:28 | Adding route: 10.0.0.0/8
02/11/2023 17:33:31 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 17:33:31 | Adding route: 10.0.0.0/8
06/11/2023 12:03:52 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
06/11/2023 12:03:52 | Adding route: 10.0.0.0/8
06/11/2023 12:03:52 | Adding destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx into the route table: rtb-00aaaaaaaaaaaaaaa
06/11/2023 12:06:02 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
06/11/2023 12:06:02 | Adding route: 10.0.0.0/8
06/11/2023 12:06:02 | Adding destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx into the route table: rtb-00ccccccccccccccc
06/11/2023 12:07:38 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> NO 10.0.0.0/8 ROUTE !!!
06/11/2023 12:07:38 | Adding route: 10.0.0.0/8
06/11/2023 12:07:38 | Adding destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx into the route table: rtb-00bbbbbbbbbbbbbbb
Note: 'NO 10.0.0.0/8 ROUTE !!!' means that full VPC information exists in the log with the full routing table but without 10.0.0.0/8 route.
Since STNO is a Hub / Spoke solution, this means that a scenario will occur where we don't have all information to custom build routes on Spoke accounts. Our most common scenario is that the Hub is a LandingZone and the Spokes are remote tenant accounts. Therefore it would be impossible for us to know the remote TGWid (in this case tgw-00zzzzzzzzzzzzzzz), nor do we have implemented a way to point the 10.0.0.0/8 to other TGW other than the currently created by STNO (in this case tgw-0xxxxxxxxxxxxxxxx).
With the logs above, its clear that:
I will feedback more into this comment as I get some more time to continue replying to your notes.
Trying to feedback some value into this with our real life experience. Thx.
Thanks for explaining the scenarios with the details. As you already understand it well that VPC tags are to define association and propagation in the TGW route tables. The Subnet tags helps with appending the subnets in TGW-VPC attachment and adding routes to the VPC route table associated with the tagged subnet. Note: We introduced another tag "Route-to-tgw" in v3.3 to help customer update route tables associated with the other subnets in the same AZ. The reason is that you can only to append a single subnet per AZ in the TGW-VPC attachment. This helps updating routes in all the associated route tables.
Based on the information provided I plan add following items in our backlog and try ship in the next planned release.
Can you please confirm if the items above meets the your expectation? Thanks
Thank you Sir., Yes, certainly.
We took the liberty of just to adding a bit more detail into the items:
Thanks
Feature request?
We have been using STNO for some time now, its awesome, but only now we detected this behaviour.
STNO v3.1.0 since birth, on our setup we had a --parameter-overrides "DefaultRoute" = "Custom-Destinations" for a default-route 10.0.0.0/8.
We were able to backtrack the function tagging the route change. "STNOStatus-RouteTable : Route(s) added to the route table."
Cloudwatch logs also show that two route tables were changed:
vpc_handler
The function _update_route_table_with_cidr_blocks to Update the Routing Table Logs "Adding route", then finds a route and updates it? ( _find_existing_default_route + _update_route_table)
The real life result to us was that, our customer had a considerable impact/downtime because he had a default-route configured towards another TGW, and when STNO finished, inside the Action "default_route_crud_operations", we observed that 10.0.0.0/8 route changed to point towards the our newly configured tgw-attachment. This could be worse to roll-back if he didn't know what TGW was there before the change.
This might be doing its thing as coded/expected, but now we are considering changing default behaviour to DefaultRoute="Configure-Manually", just to avoid changing current routes that may exist, still we will loose the automated adding routes where they may be required...
Some thoughts/suggestions come to mind:
We thank you for your thoughts, feed-back or anythings onto helping us is appreciated very much.
Thanks and keep up the good work guys. Blue