aws-solutions / network-orchestration-for-aws-transit-gateway

The Network Orchestration for AWS Transit Gateway solution automates the process of setting up and managing transit networks in distributed AWS environments. It creates a web interface to help control, audit, and approve (transit) network changes.
https://aws.amazon.com/solutions/implementations/serverless-transit-network-orchestrator/
Apache License 2.0
113 stars 48 forks source link

EnableTransitGatewayRouteTablePropagation fails due to TGW in invalid state #116

Closed ckamps closed 4 weeks ago

ckamps commented 2 months ago

Describe the bug

When using CloudFormaton to create a VPC with three subnets that include the tag with key Attach-to-tgw, the Network Orchestration automation is inconsistent in being able to successfully create a propagation for the TGW attachment. Once in a while, the expected propagation is created while in other cases it is not created.

Unlike the propagation, the association appears to be consistently created for the other TGW route table.

The symptom is similar to https://github.com/aws-solutions/network-orchestration-for-aws-transit-gateway/issues/1, but that issue applied to associations and was apparently fixed in 3.0.0.

When the propagation is not created, I see the following error:

"An error occurred (IncorrectState) when calling the EnableTransitGatewayRouteTablePropagation operation: tgw-attach-014... is in invalid state"

In the Lambda log:

...
{
    "level": "INFO",
    "location": "enable_transit_gateway_route_table_propagation:634",
    "message": "Enabling RT: tgw-rtb-09b... Propagation To Tgw Attachment",
    "timestamp": "2024-07-12 23:23:42,085+0000",
    "service": "TransitGatewayVPCAttachments",
    "xray_trace_id": "1-..."
}
{
    "level": "ERROR",
    "location": "wrapper_func:25",
    "message": "An error occurred (IncorrectState) when calling the EnableTransitGatewayRouteTablePropagation operation: tgw-attach-014... is in invalid state",
    "timestamp": "2024-07-12 23:23:42,196+0000",
    "service": "exception_handler",
    "xray_trace_id": "1-..."
}
...

It's misleading that the VPC's tags includes the VPCPropagation message shown below given that the propagation wasn't successful.

STNOStatus-VPCAssociation: 2024-07-12T23:23:41Z: VPC has been associated with the Transit Gateway Routing Table/Domain
STNOStatus-VPCPropagation: 2024-07-12T23:23:42Z: VPC RT propagation has been enabled to the Transit Gateway Routing Table/Domain
STNOStatus-VPCAttachment: 2024-07-12T23:20:07Z: VPC has been attached to the Transit Gateway
Associate-with: ex-workload-vpcs
Propagate-to:   ex-egress-vpc

To Reproduce

Since the error occurs intermittently, I have not yet determined how to consistently cause the error to occur. Typically, 1 out of every 3 or so attempts to create my stack including its VPCs and subnets will encounter this issue.

Create a stack using a CloudFormation template that creates three subnets in succession including the addition of the tag with key Attach-to-tgw.

In my environment, I have:

Expected behavior

  1. Consistent creation of propagation in the specified TGW route table.
  2. An accurate message in the VPC tag STNOStatus-VPCPropagation when propagation is not successful.

Please complete the following information about the solution:

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0058) - The AWS CloudFormation template. Version v1.0.0".

Screenshots If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context Add any other context about the problem here.

gsingh04 commented 2 months ago

thanks @ckamps for the details on the bug and the steps to replicate the behavior. please allows us to look into it and we will get back on this issue.

ckamps commented 1 month ago

It looks like the following change addresses the problem I was encountering. Since the state machine step Enable TGW Attachment Propagations already handles a ResourceBusyException exception, this step will be automatically retried when this exception is thrown.

diff --git a/source/lambda/tgw_vpc_attachment/lib/handlers/tgw_vpc_attachment_handler.py b/source/lambda/tgw_vpc_attachment/lib/handlers/tgw_vpc_attachment_handler.py
index dfec102..9813056 100644
--- a/source/lambda/tgw_vpc_attachment/lib/handlers/tgw_vpc_attachment_handler.py
+++ b/source/lambda/tgw_vpc_attachment/lib/handlers/tgw_vpc_attachment_handler.py
@@ -632,9 +632,12 @@ class TransitGatewayVPCAttachments:
             # if the return list is empty the API to enable tgw rt propagation will be skipped.
             for tgw_route_table_id in propagation_route_tables:
                 self.logger.info(f"Enabling RT: {tgw_route_table_id} Propagation To Tgw Attachment")
-                self.hub_ec2_client.enable_transit_gateway_route_table_propagation(
+                response = self.hub_ec2_client.enable_transit_gateway_route_table_propagation(
                     tgw_route_table_id,
                     self.event.get("TransitGatewayAttachmentId"))
+
+                if response.get("Error") == "IncorrectState":
+                    raise ResourceBusyException

                 self._create_tag(
                     self.event.get("VpcId"),

This change is similar to what was already implemented to force a retry when calling _add_subnet_to_tgw_attachment(self) and _remove_subnet_from_tgw_attachment(self) and encountering an IncorrectState response:

        response = self.spoke_ec2_client.remove_subnet_from_tgw_attachment(
            self.event.get("TransitGatewayAttachmentId"),
            self.event.get('SubnetId'),
        )
        if response.get("Error") == "IncorrectState":
            raise ResourceBusyException
gsingh04 commented 1 month ago

Thank you @ckamps for the details on the fix. We are testing the fix you provided in our environment. We plan to push it in upcoming release.

gsingh04 commented 1 month ago

@ckamps would it be possible for you to try out the changes in the referenced PR in your environment and confirm if it resolves the issue for you.