aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.35k stars 3.76k forks source link

(aws-ec2): NAT Gateway error while CIDR replacement in VPC #16869

Open tai-acall opened 2 years ago

tai-acall commented 2 years ago

What is the problem?

NAT Gateway not attached while CIDR changed and causing CloudFormation stack update failed.

Failed resources: UPDATE_FAILED | AWS::EC2::NatGateway | vpc/publicSubnet1/NATGateway (vpcpublicSubnet1NATGateway) NatGateway nat-xxxxxxxx is in state failed and hence failed to stabilize. Detailed failure message: Network vpc-xxxxxx has no Internet gateway attached

Reproduction Steps

replace the CIDR with something else like 10.20.0.0/16 then run deploy will trigger this issue.

    const cidr = '10.10.0.0/16';
    this.vpc = new ec2.Vpc(this, 'vpc', {
      cidr: cidr,
      enableDnsHostnames: true,
      enableDnsSupport: true,
      maxAzs: 2,
      natGateways: 1,
      flowLogs: {
        FlowLog: {
          destination: ec2.FlowLogDestination.toS3(s3Bucket),
        },
      },
      subnetConfiguration: [
        {
          name: 'public',
          subnetType: ec2.SubnetType.PUBLIC,
          cidrMask: 24,
        },
        {
          name: 'private',
          subnetType: ec2.SubnetType.PRIVATE_WITH_NAT,
          cidrMask: 24,
        },
      ],
    });

What did you expect to happen?

This should creating a new NAT Gateway and switch from old resource to the new one.

What actually happened?

the following error message.

Failed resources: UPDATE_FAILED | AWS::EC2::NatGateway | vpc/publicSubnet1/NATGateway (vpcpublicSubnet1NATGateway) NatGateway nat-xxxxxxxx is in state failed and hence failed to stabilize. Detailed failure message: Network vpc-xxxxxx has no Internet gateway attached

CDK CLI Version

1.53.0 (build 6c326cb)

Framework Version

No response

Node.js Version

v15.13.0

OS

MacOS 10.14.6

Language

Typescript

Language Version

No response

Other information

No response

musabgelisgen commented 2 years ago

Exact same issue observed on CDK CLI Version 1.86.0

njlynch commented 2 years ago

Confirmed at the latest CDK version.

The diff before the deploy shows the following resources will be replaced:

% cdk diff 2>&1 | grep 'AWS::'
[~] AWS::EC2::VPC vpc vpcA2121C38 replace
[~] AWS::EC2::Subnet vpc/publicSubnet1/Subnet vpcpublicSubnet1SubnetA635257E replace
[~] AWS::EC2::RouteTable vpc/publicSubnet1/RouteTable vpcpublicSubnet1RouteTableA38152FE replace
[~] AWS::EC2::SubnetRouteTableAssociation vpc/publicSubnet1/RouteTableAssociation vpcpublicSubnet1RouteTableAssociationB46101B8 replace
[~] AWS::EC2::Route vpc/publicSubnet1/DefaultRoute vpcpublicSubnet1DefaultRouteF0973989 replace
[~] AWS::EC2::NatGateway vpc/publicSubnet1/NATGateway vpcpublicSubnet1NATGateway974E731F replace
[~] AWS::EC2::Subnet vpc/publicSubnet2/Subnet vpcpublicSubnet2Subnet027D165B replace
[~] AWS::EC2::RouteTable vpc/publicSubnet2/RouteTable vpcpublicSubnet2RouteTableA6135437 replace
[~] AWS::EC2::SubnetRouteTableAssociation vpc/publicSubnet2/RouteTableAssociation vpcpublicSubnet2RouteTableAssociation73F6478A replace
[~] AWS::EC2::Route vpc/publicSubnet2/DefaultRoute vpcpublicSubnet2DefaultRoute13685A07 replace
[~] AWS::EC2::Subnet vpc/privateSubnet1/Subnet vpcprivateSubnet1SubnetAE1393DC replace
[~] AWS::EC2::RouteTable vpc/privateSubnet1/RouteTable vpcprivateSubnet1RouteTableC1CE9D76 replace
[~] AWS::EC2::SubnetRouteTableAssociation vpc/privateSubnet1/RouteTableAssociation vpcprivateSubnet1RouteTableAssociationD9FC1FAE replace
[~] AWS::EC2::Route vpc/privateSubnet1/DefaultRoute vpcprivateSubnet1DefaultRoute22F06BF9 replace
[~] AWS::EC2::Subnet vpc/privateSubnet2/Subnet vpcprivateSubnet2Subnet1C8B0CEE replace
[~] AWS::EC2::RouteTable vpc/privateSubnet2/RouteTable vpcprivateSubnet2RouteTable882A110C replace
[~] AWS::EC2::SubnetRouteTableAssociation vpc/privateSubnet2/RouteTableAssociation vpcprivateSubnet2RouteTableAssociationF1D5617F replace
[~] AWS::EC2::Route vpc/privateSubnet2/DefaultRoute vpcprivateSubnet2DefaultRouteF7D5A1BD replace

I believe the issue here is that the VPC is being replace (as is the NAT Gateway), but the VPCGatewayAttachment isn't being updated as well. This means that the new VPC ends up without the attached Internet Gateway, leading to the error. We likely need to add a dependency between the InternetGateway (+Attachement) and the NAT so they are replaced in the correct order.

Thanks for filing the bug! We welcome community contributions! If you are able, we encourage you to contribute. If you decide to contribute, please start an engineering discussion in this issue to ensure there is a commonly understood design before submitting code. This will minimize the number of review cycles and get your code merged faster.

omriman12 commented 2 years ago

same issue for me, any work-around?

corymhall commented 2 years ago

@tai-acall It looks like this type of update is not currently possible with CloudFormation. The order of operations that CloudFormation will process in this type of change would be:

  1. New VPC is created
  2. VPCGatewayAttachment is updated (detaches the IGW from the old VPC and attaches the IGW to the new VPC)
  3. NatGateway is created in the new VPC
  4. NatGateway is deleted in the old VPC
  5. Old VPC is deleted

The problem with this order is that an Internet Gateway can only be attached to 1 VPC at a time. So step 2 needs to first detach the Internet Gateway from old VPC before it attaches it to the new. In order to detach the IGW from the old VPC that VPC cannot have any allocated Elastic IP addresses like the one allocated to the NatGateway. So the actual order of operations that CloudFormation would need to be able to process would be:

  1. New VPC is created
  2. NatGateway is deleted from old VPC (releasing the EIP)
  3. VPCGatewayAttachment is updated (detaches the IGW from the old VPC and attaches the IGW to the new VPC)
  4. NatGateway is created in the new VPC
  5. Old VPC is deleted

Unless CloudFormation allows for more granular control over the order of operation this will not be possible to fix. Since this is a very destructive action (destroying an recreating the VPC), the recommended workaround is to first delete the stack with the old CIDR and the create the stack fresh with the new CIDR.

thiagobasilio-nanga commented 2 years ago

I have the same issue with CDK 2.10.0 (build e5b301f). And this will probably happen on any version of AWS CDK until CloudFormation allows for more granular control over the order of the operations like @corymhall has quoted.

MariaRocco commented 2 years ago

are there any known workarounds for this error?

thiagobasilio-nanga commented 2 years ago

are there any known workarounds for this error?

Probably not. When you see the label needs-cfn it means that this issue is waiting on changes to CloudFormation before it can be addressed.

vincent851 commented 2 years ago

I am facing the same issue here. Can we have this prioritize for fix? VPC encomposes and integrated with multiple stacks. it is not possible for us to delete this stacks without first deleting all associated/dependent stacks.

caveman-dick commented 2 years ago

What I did for this as a workaround was to change the id of the VPC which will force create a whole new VPC and then delete the old one rather than trying to move things over. As the VPC itself and all of the subnets need to be replaced anyway there isn't much difference in reality.

If you have dependencies you will need to replace them to update the VPC so you will need to have a transitionary period where you have both old and new VPCs in place.

vincent851 commented 2 years ago

What I did for this as a workaround was to change the id of the VPC which will force create a whole new VPC and then delete the old one rather than trying to move things over. As the VPC itself and all of the subnets need to be replaced anyway there isn't much difference in reality.

If you have dependencies you will need to replace them to update the VPC so you will need to have a transitionary period where you have both old and new VPCs in place.

Thank you for sharing a work around @caveman-dick. This is a good work around...I am debating whether to do this because this will create outage for our produciton environment. nonetheless, this is a good work around. Thanks.

caveman-dick commented 2 years ago

What I did for this as a workaround was to change the id of the VPC which will force create a whole new VPC and then delete the old one rather than trying to move things over. As the VPC itself and all of the subnets need to be replaced anyway there isn't much difference in reality. If you have dependencies you will need to replace them to update the VPC so you will need to have a transitionary period where you have both old and new VPCs in place.

Thank you for sharing a work around @caveman-dick. This is a good work around...I am debating whether to do this because this will create outage for our produciton environment. nonetheless, this is a good work around. Thanks.

Yeah we are having to do this for our environment atm as we need to migrate instances off EC2-Classic. The VPC's we had in place have CIDRs that clash with the EC2Classic CIDR so I can't setup ClassicLink correctly. I'm going to run both VPCs side by side until all dependancies are migrated over and then drop the old one. Thankfully we don't have that much in the VPC atm and nothing that will cause an outage when we move over.

peterwoodworth commented 2 years ago

Since we need CloudFormation support for this - if anyone is interested in seeing this feature I would recommend opening an issue in the CloudFormation coverage roadmap if one doesn't exist yet.