aws-solutions / network-orchestration-for-aws-transit-gateway

The Network Orchestration for AWS Transit Gateway solution automates the process of setting up and managing transit networks in distributed AWS environments. It creates a web interface to help control, audit, and approve (transit) network changes.
https://aws.amazon.com/solutions/implementations/serverless-transit-network-orchestrator/
Apache License 2.0
124 stars 52 forks source link

STNO Spoke VPC Routing Update #105

Open Cupidazul opened 1 year ago

Cupidazul commented 1 year ago

Feature request?

We have been using STNO for some time now, its awesome, but only now we detected this behaviour.

STNO v3.1.0 since birth, on our setup we had a --parameter-overrides "DefaultRoute" = "Custom-Destinations" for a default-route 10.0.0.0/8.

We were able to backtrack the function tagging the route change. "STNOStatus-RouteTable : Route(s) added to the route table."

Cloudwatch logs also show that two route tables were changed:

2023-11-06 12:06:z1,zzz Adding route: 10.0.0.0/8
2023-11-06 12:06:z1,zzz Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxxxxx into the route table: rtb-yyyyyyyyyyyy
2023-11-06 12:07:z8,zzz Adding route: 10.0.0.0/8
2023-11-06 12:07:z8,zzz Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxxxxx into the route table: rtb-tttttttttttttttttt

vpc_handler

    def _update_route_table_with_cidr_blocks(self, existing_routes):
        cidr_blocks = convert_string_to_list_with_no_whitespaces(environ.get("CIDR_BLOCKS"))
        if len(cidr_blocks) > 0:
            for route in cidr_blocks:
                self.logger.debug(f"Adding route: {route}")
                self._find_existing_default_route(existing_routes, route)
                self._update_route_table(route)

...
def _update_route_table(self, route):
...
            if (
                self.event.get("Action") == "AddSubnet"
                or self.event.get("Action") == "CreateTgwVpcAttachment"
            ):
                # create route in spoke account route table
                self._create_route(ec2, route)
            elif (
                self.event.get("Action") == "RemoveSubnet"
                or self.event.get("Action") == "DeleteTgwVpcAttachment"
            ):
                # delete route from spoke account route table
                self._delete_route(ec2, route)
...
def _create_route(self, destination):
...
                else:
                    self.spoke_ec2_client.create_route_cidr_block(
                        destination,
                        self.event.get("RouteTableId"),
                        environ.get("TGW_ID"),
                    )
                self._create_tag(
                    self.event.get("RouteTableId"),
                    "RouteTable",
                    "Route(s) added to the route table.",
                )
...
ec2.py

    def create_route_cidr_block(
            self,
            vpc_cidr: str,
            route_table_id: str,
            transit_gateway_id: str
    ) -> CreateRouteResultTypeDef:
        response = self.ec2_client.create_route(
            DestinationCidrBlock=vpc_cidr,
            RouteTableId=route_table_id,
            TransitGatewayId=transit_gateway_id,
        )

The function _update_route_table_with_cidr_blocks to Update the Routing Table Logs "Adding route", then finds a route and updates it? ( _find_existing_default_route + _update_route_table)

The real life result to us was that, our customer had a considerable impact/downtime because he had a default-route configured towards another TGW, and when STNO finished, inside the Action "default_route_crud_operations", we observed that 10.0.0.0/8 route changed to point towards the our newly configured tgw-attachment. This could be worse to roll-back if he didn't know what TGW was there before the change.

This might be doing its thing as coded/expected, but now we are considering changing default behaviour to DefaultRoute="Configure-Manually", just to avoid changing current routes that may exist, still we will loose the automated adding routes where they may be required...

Some thoughts/suggestions come to mind:

  1. In the _update_route_table_with_cidr_blocks, shouldn't the log state "Updating route" instead of "Adding route"? if the function proceeds into a find + update.
  2. There should be some protection/validation before updating an existing route.
  3. Its risky to change routes thru an Automated procedure, may need a bit more controls on these changes, routing rules may differ from customer to customer.
  4. Is the create-route ec2.py updating the route, instead of simply creating it, or is this done somewhere else? RollBack the route to previous TGW should exist if STNO process fails.
  5. Just saw that there is a _delete_route inside the fn _update_route_table, but there was no log informing the deletion of 10.0.0.0/8. And found in the logs more info showing STNO was messing up the VPC routing tables for a while, ever since we did add, delete then add VPC TAGs again. The delete TAGs removed the route which is awful.
  6. fn _find_existing_default_route logged no findings.
  7. fn _update_route_table logged no findings.
  8. In fact, it was observed that no route to 10.0.0.0/8 existed in 3x routing tables on tenants side, but no downtime was detected until a successfull STNO, where a new route was added to each RT on Monday 12h00: "Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxx into the route table:". No 10.0.0.0/8 existed during the weekend, and no issue was reported, only at 14h00-Monday after adding wrong TGW to routing tables.

We thank you for your thoughts, feed-back or anythings onto helping us is appreciated very much.

Thanks and keep up the good work guys. Blue

groverlalit commented 1 year ago

Good morning, Thanks for providing us the feedback and sharing details of this event.

It seems the user tagged a VPC that already had a route to a different transit gateway (from the one defined in the STNO solution stack).

By design, the solution creates route(s) in the VPC route table using the customer defined (CFN Params) destination and target. If the destination (example, 0/0, RFC-1918, Custom destinations) already exist in the route table with another target, STNO does not add the route or update the route with new destination to avoid impact to the customer network.

The supported use cases does not access all the existing routes and evaluate of all the possible scenarios. It would great if you can open a feature request to help us learn about new use cases.

As per the ticket, if we expect users to tag VPCs with existing routes to different TGWs, then using Configure-Manually option is the best option. IMO, if majority of the VPCs are new that needs to be attached with TGW, I would recommend to skip tagging the VPC with existing routes to unmanaged TGWs. Second option could be the use of "Conditional" tags for TGW route tables. For example, if the VPCs that have existing routes to different TGW are in a different OU in the AWS Organization, you can use the complaince feature to auto-reject attachments and avoid impact to the network. (2, 3)

For 5, deletion of the routes added by the STNO workflow once you remove the tag is by design. The _delete_route function should add (this message)[https://github.com/aws-solutions/network-orchestration-for-aws-transit-gateway/blob/main/source/lambda/tgw_vpc_attachment/lib/handlers/vpc_handler.py#L270-L274] in the logs. Please feel free to share how we can improve this message.

For 6, 7 We have added backlog items to add specific log messages in the log group.

For 8, we are thinking about how we can help STNO user detection issues sooner. Any input/guidance based on your experience with STNO would be great.

If you prefer to share more details via a support ticket/TAM/SA, can you please refer the GitHub issue id to help us connect the two tickets.

Cupidazul commented 1 year ago

Good Morning Sir.,

And thank you for your comprehensive reply.

"By design, the solution creates route(s) in the VPC route table using the customer defined (CFN Params) destination and target. If the destination (example, 0/0, RFC-1918, Custom destinations) already exist in the route table with another target, STNO does not add the route or update the route with new destination to avoid impact to the customer network."

This was exactly what we saw, probably our STNO setup did delete the route 10.0.0.0/8 at some point, as we can see in the logs (but without impact to the customer), then when STNO process succeeded at 6/11 we saw the 10.0.0.0/8 being added. Before 6/11 we had issues with STNO Tagging (issue was: we were only tagging the VPCs and approving while the SubNets were not yet tagged).

Our setup is: "DefaultRoute" = "Custom-Destinations" CIDR_BLOCKS = "10.0.0.0/8"

I was able to filter the logs to give us a better view of this:

02/11/2023 16:18:51 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:18:54 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:19:02 | RouteTableId': 'rtb-00ccccccccccccccc' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:24:59 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:24:59 | Removing destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx from the route table: rtb-00bbbbbbbbbbbbbbb
02/11/2023 16:25:03 | RouteTableId': 'rtb-00ccccccccccccccc' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:25:03 | Removing destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx from the route table: rtb-00ccccccccccccccc
02/11/2023 16:25:04 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> 'DestinationCidrBlock': '10.0.0.0/8', 'TransitGatewayId': 'tgw-00zzzzzzzzzzzzzzz'
02/11/2023 16:25:04 | Removing destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx from the route table: rtb-00aaaaaaaaaaaaaaa
02/11/2023 16:32:46 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:32:46 | Adding route: 10.0.0.0/8
02/11/2023 16:32:49 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:32:49 | Adding route: 10.0.0.0/8
02/11/2023 16:56:31 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:56:31 | Adding route: 10.0.0.0/8
02/11/2023 16:58:41 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:58:41 | Adding route: 10.0.0.0/8
02/11/2023 16:58:48 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:58:48 | Adding route: 10.0.0.0/8
02/11/2023 16:58:52 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 16:58:52 | Adding route: 10.0.0.0/8
02/11/2023 17:33:26 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 17:33:26 | Adding route: 10.0.0.0/8
02/11/2023 17:33:28 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 17:33:28 | Adding route: 10.0.0.0/8
02/11/2023 17:33:31 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
02/11/2023 17:33:31 | Adding route: 10.0.0.0/8
06/11/2023 12:03:52 | RouteTableId': 'rtb-00aaaaaaaaaaaaaaa' >> NO 10.0.0.0/8 ROUTE !!!
06/11/2023 12:03:52 | Adding route: 10.0.0.0/8
06/11/2023 12:03:52 | Adding destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx into the route table: rtb-00aaaaaaaaaaaaaaa
06/11/2023 12:06:02 | RouteTableId': 'rtb-00ccccccccccccccc' >> NO 10.0.0.0/8 ROUTE !!!
06/11/2023 12:06:02 | Adding route: 10.0.0.0/8
06/11/2023 12:06:02 | Adding destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx into the route table: rtb-00ccccccccccccccc
06/11/2023 12:07:38 | RouteTableId': 'rtb-00bbbbbbbbbbbbbbb' >> NO 10.0.0.0/8 ROUTE !!!
06/11/2023 12:07:38 | Adding route: 10.0.0.0/8
06/11/2023 12:07:38 | Adding destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx into the route table: rtb-00bbbbbbbbbbbbbbb

Note: 'NO 10.0.0.0/8 ROUTE !!!' means that full VPC information exists in the log with the full routing table but without 10.0.0.0/8 route.

Since STNO is a Hub / Spoke solution, this means that a scenario will occur where we don't have all information to custom build routes on Spoke accounts. Our most common scenario is that the Hub is a LandingZone and the Spokes are remote tenant accounts. Therefore it would be impossible for us to know the remote TGWid (in this case tgw-00zzzzzzzzzzzzzzz), nor do we have implemented a way to point the 10.0.0.0/8 to other TGW other than the currently created by STNO (in this case tgw-0xxxxxxxxxxxxxxxx).

With the logs above, its clear that:

  1. The Route was deleted when the log says "Removing destination : 10.0.0.0/8 to TGW gateway: tgw-0xxxxxxxxxxxxxxxx", but actually the tgw was another "tgw-00zzzzzzzzzzzzzzz". Therefore its safe to say that this "Removing destination" is not accurate to the TGW it says, and it should never have deleted this route ( I do know that we were adding and removing TAGs from VPCs / Subnets, and its expected that the routes would be deleted, but not towards another destination other than through the attachment currently created. ). At least a rollback procedure should there to undo the routes added.
  2. Customers will not have visibility as the TAGs on their side dont show whats being added or removed. Into the tags there could be more information related to the route CIDR and TGW. “Route(s) added to the route table” “Route(s) deleted to the route table”

I will feedback more into this comment as I get some more time to continue replying to your notes.

Trying to feedback some value into this with our real life experience. Thx.

groverlalit commented 1 year ago

Thanks for explaining the scenarios with the details. As you already understand it well that VPC tags are to define association and propagation in the TGW route tables. The Subnet tags helps with appending the subnets in TGW-VPC attachment and adding routes to the VPC route table associated with the tagged subnet. Note: We introduced another tag "Route-to-tgw" in v3.3 to help customer update route tables associated with the other subnets in the same AZ. The reason is that you can only to append a single subnet per AZ in the TGW-VPC attachment. This helps updating routes in all the associated route tables.

Based on the information provided I plan add following items in our backlog and try ship in the next planned release.

  1. STNO must ONLY delete an existing route if the destination is the tagged subnet (which is obvious in this case) AND target must be TGW ID defined in the CFN parameter. This would avoid changing routes not managed by the solution.
  2. Add more metadata in the logs to easily identify the specific changes in the route table.

Can you please confirm if the items above meets the your expectation? Thanks

Cupidazul commented 1 year ago

Thank you Sir., Yes, certainly.

We took the liberty of just to adding a bit more detail into the items:

  1. If "Route-to-tgw" is not defined on SubNets, then, ANY routes should be added/deleted on Spoke account VPC route tables. If TAG "Route-to-tgw" is defined on any SubNet, then, that TAG must contains the value of the destination TGWid ( ex: "Route-to-tgw" := "tgw-00zzzzzzzzzzzzzzz" ), then a route is added ONLY to the routing table associated to the SubNets where the tag is present pointing to target TGWid, any routes deletion should ONLY occur if the target matches this TGWid tag value.
  2. Add more metadata in the logs _ ( _createtag, logger.debug, logger.info, logger.ANY ) to easily identify the specific changes in the route table. Including CIDR's added/deleted and object added/deleted ( i.e. TGWid )

Thanks