aws-cloudformation / cloudformation-coverage-roadmap

The AWS CloudFormation Public Coverage Roadmap
https://aws.amazon.com/cloudformation/
Creative Commons Attribution Share Alike 4.0 International
1.11k stars 56 forks source link

AWS::EC2::TransitGatewayRouteTableAssociation replacement fails #488

Closed guidola closed 5 months ago

guidola commented 4 years ago

2. Scope of request

AWS::EC2::TransitGatewayRouteTableAssociation fails on UPDATE operation when replacement of the current existing association for a newly defined/modified resource is required. i.e. manually changing the route-table a transit gateway attachment is associated to.

cfn-tgw-rt-attach-fail_LI (2)

3. Expected behavior

The existing TransitGatewayRouteTableAssociation should be removed and replaced by its new definition.

4. Suggest specific test cases

5. Helpful Links to speed up research and evaluation

6. Category (required) - Will help with tagging and be easier to find by other users to +1

Networking & Content (VPC, Route53, API GW,...)

7. Any additional context (optional)

yakireliyahu1987 commented 2 years ago

TL;DR - a workaround for updating a Transit Gateway Route Table Association with Cloudformation, using a Custom Resource

For anyone that stumbles into this issue, I've managed to create a workaround for it and I thought that it might help others.

Before we dive into the workaround, it's important to understand how Cloudformation works and why the issue even happens. According to the AWS::EC2::TransitGatewayRouteTableAssociation documentation, updating any of the resource properties requires a replacement of the resource.

In Cloudformation, when a resource needs to be replaced, a new resource is first created and only at the end of the stack update (UPDATE_COMPLETE_CLEANUP_IN_PROGRESS). As Transit Gateway Attachments can only be associated with one route table, this causes the EC2 service to emit an error (as shown by @guidola 's post)

In order to overcome this issue, there is a need to implement a "destroy-then-create" operation on the resource, which is not supported natively by Cloudformation. The workaround performs this operation by invoking a Custom Resource (Lambda function with Python runtime) which:

In order to avoid errors during the cleanup process, the TGW association resource update/replace policy was set to Retain so Cloudformation will not attempt to delete to "old" association. In addition, to ensure the proper order of execution (custom resource -> TGW route table association) a dependency to the Custom Resource has been set in the AWS::EC2::TransitGatewayRouteTableAssociation resource.

The Cloudformation snippet below implements the above workaround - it has been tested on stack creation and update, but it's advised to test it on your own before applying it into production.

Note that there are some placeholders in the template - replace them with your own resources' references

---
Resources:
  DeleteTGWAssociationWhenTableIDChangesRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
        Version: "2012-10-17"
      ManagedPolicyArns:
        - Fn::Join:
            - ""
            - - "arn:"
              - Ref: AWS::Partition
              - ":iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
      Policies:
        - PolicyDocument:
            Statement:
              - Action:
                  - ec2:DescribeTransitGatewayAttachments
                  - ec2:DescribeTransitGatewayRouteTables
                  - ec2:DisassociateTransitGatewayRouteTable
                Effect: Allow
                Resource: "*"
            Version: "2012-10-17"
          PolicyName: AllowDisassociateTransitGatewayRouteTable
  DeleteTGWAssociationWhenTableIDChangesFunction:
    Type: AWS::Lambda::Function
    Properties:
      Code:
        ZipFile:
          "import boto3\nimport urllib3\nimport json\nimport time\ntry:\n    import
          botostubs\nexcept:\n    pass\n\ndef cfn_send(event, context, responseStatus,
          responseData, physicalResourceId=None, noEcho=False, reason=None):\n    http
          = urllib3.PoolManager()\n    responseUrl = event['ResponseURL']\n\n    print(responseUrl)\n\n
          \   responseBody = {}\n    responseBody['Status'] = responseStatus\n    responseBody['Reason']
          = reason if reason else 'See the details in CloudWatch Log Stream: ' + context.log_stream_name\n
          \   responseBody['PhysicalResourceId'] = physicalResourceId or context.log_stream_name\n
          \   responseBody['StackId'] = event['StackId']\n    responseBody['RequestId']
          = event['RequestId']\n    responseBody['LogicalResourceId'] = event['LogicalResourceId']\n
          \   responseBody['NoEcho'] = noEcho\n    responseBody['Data'] = responseData\n\n
          \   json_responseBody = json.dumps(responseBody)\n\n    print(\"Response
          body:\\n\" + json_responseBody)\n\n    headers = {\n        'content-type'
          : '',\n        'content-length' : str(len(json_responseBody))\n    }\n\n
          \   try:\n        \n        response = http.request('PUT',responseUrl,body=json_responseBody.encode('utf-8'),headers=headers)\n
          \       print(\"Status code: \" + response.reason)\n    except Exception
          as e:\n        print(\"send(..) failed executing requests.put(..): \" +
          str(e))\n\ndef lambda_handler(event, context):\n    print(event)\n    event_props
          = event.get('ResourceProperties', {})\n\n    try:\n        client = boto3.client(\"ec2\")
          #type: botostubs.EC2\n        tgw_route_table_id = event_props[\"tgw_route_table_id\"]\n
          \       tgw_attachment_id = event_props[\"tgw_attachment_id\"]\n\n        if
          event[\"RequestType\"] in [\"Create\",\"Update\"]:\n            if not client.describe_transit_gateway_route_tables(TransitGatewayRouteTableIds=[tgw_route_table_id])[\"TransitGatewayRouteTables\"]:\n
          \               raise Exception(f\"Transit Gateway Route Table ID {tgw_route_table_id}
          does not exist or cannot be found!\")\n            \n            association
          = client.describe_transit_gateway_attachments(\n                TransitGatewayAttachmentIds=[tgw_attachment_id]\n
          \           )[\"TransitGatewayAttachments\"][0].get(\"Association\")\n\n
          \           if association and association[\"TransitGatewayRouteTableId\"]
          != tgw_route_table_id:\n                response = client.disassociate_transit_gateway_route_table(\n
          \                   TransitGatewayRouteTableId=association[\"TransitGatewayRouteTableId\"],\n
          \                   TransitGatewayAttachmentId=tgw_attachment_id\n                )\n
          \               # Wait for attachment to disassociate\n                while
          client.describe_transit_gateway_attachments(TransitGatewayAttachmentIds=[tgw_attachment_id])[\"TransitGatewayAttachments\"][0].get(\"Association\"):\n
          \                   pass\n\n\n        \n        return cfn_send(event, context,
          responseStatus=\"SUCCESS\",responseData=None, physicalResourceId=None)\n
          \   except Exception as err:\n        print(str(err))\n        return cfn_send(event,
          context, responseStatus=\"FAILED\",responseData=None, reason=str(err))\n\n\n\n\n\n"
      Role:
        Fn::GetAtt:
          - DeleteTGWAssociationWhenTableIDChangesRole
          - Arn
      Description:
        This function provides a workaround for changing the association of a Transit Gateway Attachment's
        route table association, due to CloudFormation limitations
      FunctionName: tgw-route-table-disassociate-helper
      Handler: index.lambda_handler
      MemorySize: 128
      Runtime: python3.8
      Timeout: 10
    DependsOn:
      - DeleteTGWAssociationWhenTableIDChangesRole
  DeleteTGWAssociationWhenTableIDChanges:
    Type: AWS::CloudFormation::CustomResource
    Properties:
      ServiceToken:
        Fn::GetAtt:
          - DeleteTGWAssociationWhenTableIDChangesFunction
          - Arn
      tgw_route_table_id: 
        Ref: <TGWRouteTable resource>
      tgw_attachment_id:
        Ref: <TGWAttachment resource>
    UpdateReplacePolicy: Delete
    DeletionPolicy: Delete
  TGWRouteTableAssociation:
    Type: AWS::EC2::TransitGatewayRouteTableAssociation
    Properties:
      TransitGatewayAttachmentId:
        Ref: <TGWAttachment resource>
      TransitGatewayRouteTableId: 
        Ref: <TGWRouteTable resource>
    DependsOn:
      - DeleteTGWAssociationWhenTableIDChanges
    UpdateReplacePolicy: Retain
    DeletionPolicy: Retain

Here's a snippet of the lambda function code in a more readable way - I've taken the send() function from the cfnresponse python module and embedded it in the function, as I wanted to have the ability to see the actual error in Cloudformation in case there was any, instead of searching in Cloudwatch logs.

import boto3
import urllib3
import json
import time
try:
    import botostubs
except:
    pass

def cfn_send(event, context, responseStatus, responseData, physicalResourceId=None, noEcho=False, reason=None):
    http = urllib3.PoolManager()
    responseUrl = event['ResponseURL']

    print(responseUrl)

    responseBody = {}
    responseBody['Status'] = responseStatus
    responseBody['Reason'] = reason if reason else 'See the details in CloudWatch Log Stream: ' + context.log_stream_name
    responseBody['PhysicalResourceId'] = physicalResourceId or context.log_stream_name
    responseBody['StackId'] = event['StackId']
    responseBody['RequestId'] = event['RequestId']
    responseBody['LogicalResourceId'] = event['LogicalResourceId']
    responseBody['NoEcho'] = noEcho
    responseBody['Data'] = responseData

    json_responseBody = json.dumps(responseBody)

    print("Response body:\n" + json_responseBody)

    headers = {
        'content-type' : '',
        'content-length' : str(len(json_responseBody))
    }

    try:

        response = http.request('PUT',responseUrl,body=json_responseBody.encode('utf-8'),headers=headers)
        print("Status code: " + response.reason)
    except Exception as e:
        print("send(..) failed executing requests.put(..): " + str(e))

def lambda_handler(event, context):
    print(event)
    event_props = event.get('ResourceProperties', {})

    try:
        client = boto3.client("ec2") #type: botostubs.EC2
        tgw_route_table_id = event_props["tgw_route_table_id"]
        tgw_attachment_id = event_props["tgw_attachment_id"]

        if event["RequestType"] in ["Create","Update"]:
            if not client.describe_transit_gateway_route_tables(TransitGatewayRouteTableIds=[tgw_route_table_id])["TransitGatewayRouteTables"]:
                raise Exception(f"Transit Gateway Route Table ID {tgw_route_table_id} does not exist or cannot be found!")

            association = client.describe_transit_gateway_attachments(
                TransitGatewayAttachmentIds=[tgw_attachment_id]
            )["TransitGatewayAttachments"][0].get("Association")

            if association and association["TransitGatewayRouteTableId"] != tgw_route_table_id:
                response = client.disassociate_transit_gateway_route_table(
                    TransitGatewayRouteTableId=association["TransitGatewayRouteTableId"],
                    TransitGatewayAttachmentId=tgw_attachment_id
                )
                # Wait for attachment to disassociate
                while client.describe_transit_gateway_attachments(TransitGatewayAttachmentIds=[tgw_attachment_id])["TransitGatewayAttachments"][0].get("Association"):
                    pass

        return cfn_send(event, context, responseStatus="SUCCESS",responseData=None, physicalResourceId=None)
    except Exception as err:
        print(str(err))
        return cfn_send(event, context, responseStatus="FAILED",responseData=None, reason=str(err))

I hope that someone finds it useful :)

thu001 commented 6 months ago

The issue has been resolved. The new resource schema now enforces delete_then_create when it comes to update/replacement .

thu001 commented 5 months ago

This issue has been fixed. The resource now supports update by delete_then_create. Below is a testing stack that successfully performed resource update. image

ammokhov commented 5 months ago

Closing the issue as the fix has been pushed