awslabs / landing-zone-accelerator-on-aws

Deploy a multi-account cloud foundation to support highly-regulated workloads and complex compliance requirements.
https://aws.amazon.com/solutions/implementations/landing-zone-accelerator-on-aws/
Apache License 2.0
532 stars 424 forks source link

AWSAccelerator-NetworkAssociationsStack - Failed #276

Closed david-midlink closed 11 months ago

david-midlink commented 11 months ago

Describe the bug While executing the pipeline and attempting to add a tgw-attachment to the transit gateway route table during propagation (Network_Associations), the process fails. This failure is due to a resource that no longer exists, causing the operation to become stuck.

Everything functioned smoothly until I established an account named "CorpIT". Everything ran seamlessly up to the propagation stage. When it failed, I attempted to remove it, but the Cloudformation became unresponsive. Despite deleting everything, the system still seems to recognize it for no apparent reason.

AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 \| 0/5 \| 1:33:07 PM \| UPDATE_ROLLBACK_IN_P \| AWS::CloudFormation::Stack                    \| AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 The following resource(s) failed to create: [CorpItMainNetworkMainCorePropagationXXXXXXXX, EndpointsVpcMainTgwCoreRtPropagationXXXXXXXX, NetworkEndpointsNetworkMainSpokePropagationXXXXXXXX, NetworkEndpointsNetworkMainCorePropagationXXXXXXXX, EndpointsVpcMainTgwSpokeRtPropagationXXXXXXXX]. The following resource(s) failed to update: [AssociateHostedZonesF0E2F0DA].
913
Failed resources:
--
927 | AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 \| 1:33:06 PM \| CREATE_FAILED        \| AWS::EC2::TransitGatewayRouteTablePropagation \| EndpointsVpcMainTgwSpokeRtPropagation (EndpointsVpcMainTgwSpokeRtPropagationXXXXXXXX) Internal Failure
928 | new TransitGatewayRouteTablePropagation (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/constructs/lib/aws-ec2/transit-gateway.ts:69:5)
929 | \_ NetworkAssociationsStack.createVpcTransitGatewayPropagations (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/stacks/network-stacks/network-associations-stack/network-associations-stack.ts:986:15)
930 | \_ NetworkAssociationsStack.createTransitGatewayResources (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/stacks/network-stacks/network-associations-stack/network-associations-stack.ts:752:12)
931 | \_ new NetworkAssociationsStack (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/stacks/network-stacks/network-associations-stack/network-associations-stack.ts:139:12)
932 | \_ main (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/bin/app.ts:980:44)
933 | \_ processTicksAndRejections (node:internal/process/task_queues:96:5)
934 | \_ async /codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/bin/app.ts:1017:5
935 | AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 \| 1:33:06 PM \| CREATE_FAILED        \| AWS::EC2::TransitGatewayRouteTablePropagation \| EndpointsVpcMainTgwCoreRtPropagation (EndpointsVpcMainTgwCoreRtPropagationXXXXXXXX) Internal Failure
936 | new TransitGatewayRouteTablePropagation (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/constructs/lib/aws-ec2/transit-gateway.ts:69:5)
937 | \_ NetworkAssociationsStack.createVpcTransitGatewayPropagations (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/stacks/network-stacks/network-associations-stack/network-associations-stack.ts:986:15)
938 | \_ NetworkAssociationsStack.createTransitGatewayResources (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/stacks/network-stacks/network-associations-stack/network-associations-stack.ts:752:12)
939 | \_ new NetworkAssociationsStack (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/stacks/network-stacks/network-associations-stack/network-associations-stack.ts:139:12)
940 | \_ main (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/bin/app.ts:980:44)
941 | \_ processTicksAndRejections (node:internal/process/task_queues:96:5)
942 | \_ async /codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/bin/app.ts:1017:5
943 |  
944 | ❌  AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 failed: Error: The stack named AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 failed to deploy: UPDATE_ROLLBACK_COMPLETE (Update successful. One or more resources could not be deleted.): Internal Failure, Internal Failure
945 | at FullCloudFormationDeployment.monitorDeployment (/codebuild/output/src4126/src/s3/00/source/node_modules/aws-cdk/lib/api/deploy-stack.ts:512:13)
946 | at processTicksAndRejections (node:internal/process/task_queues:96:5)
947 | at async deployStack (/codebuild/output/src4126/src/s3/00/source/node_modules/aws-cdk/lib/cdk-toolkit.ts:265:24)
948 | at async /codebuild/output/src4126/src/s3/00/source/node_modules/aws-cdk/lib/deploy.ts:39:11
949 | at async run (/codebuild/output/src4126/src/s3/00/source/node_modules/p-queue/dist/index.js:163:29)
950 |  
951 | ❌ Deployment failed: Error: Stack Deployments Failed: Error: The stack named AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 failed to deploy: UPDATE_ROLLBACK_COMPLETE (Update successful. One or more resources could not be deleted.): Internal Failure, Internal Failure
952 | at deployStacks (/codebuild/output/src4126/src/s3/00/source/node_modules/aws-cdk/lib/deploy.ts:61:11)
953 | at processTicksAndRejections (node:internal/process/task_queues:96:5)
954 | at async CdkToolkit.deploy (/codebuild/output/src4126/src/s3/00/source/node_modules/aws-cdk/lib/cdk-toolkit.ts:339:7)
955 | at async Function.execute (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/toolkit.ts:312:9)
956 | at async Promise.all (index 4)
957 | at async Function.run (/codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/lib/accelerator.ts:601:5)
958 | at async /codebuild/output/src4126/src/s3/00/source/packages/@aws-accelerator/accelerator/cdk.ts:100:3
959 | 2023-10-06 13:39:32.866 \| error \| toolkit \| Stack Deployments Failed: Error: The stack named AWSAccelerator-NetworkAssociationsStack-111111111111-eu-central-1 failed to deploy: UPDATE_ROLLBACK_COMPLETE (Update successful. One or more resources could not be deleted.): Internal Failure, Internal Failure
960 | Deployment failed
961 | error Command failed with exit code 1.
962 | info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
963 |  
964 | [Container] 2023/10/06 13:39:32 Command did not exit successfully yarn run ts-node --transpile-only cdk.ts --require-approval never $CDK_OPTIONS --config-dir $CODEBUILD_SRC_DIR_Config --partition aws --app cdk.out exit status 1
965 | [Container] 2023/10/06 13:39:32 Phase complete: BUILD State: FAILED
966 | [Container] 2023/10/06 13:39:32 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: yarn run ts-node --transpile-only cdk.ts --require-approval never $CDK_OPTIONS --config-dir $CODEBUILD_SRC_DIR_Config --partition aws --app cdk.out. Reason: exit status 1
967 | [Container] 2023/10/06 13:39:32 Entering phase POST_BUILD
968 | [Container] 2023/10/06 13:39:32 Phase complete: POST_BUILD State: SUCCEEDED
969 | [Container] 2023/10/06 13:39:32 Phase context status code:  Message:

To Reproduce Even after deleting all components associated with the network configuration and running it again, the issue persists.

Expected behavior To make propagation function as it did on all other accounts.

Please complete the following information about the solution:

awsclemj commented 11 months ago

Hello @david-midlink, and thank you for opening an issue with the Landing Zone Accelerator team!

I would like to ensure I fully understand the nature of the error you're receiving. Based on your issue description and error messages, it sounds as though a new Transit Gateway route table propagation is being created for an attachment or route table that no longer exists in the environment -- is that accurate?

If so, the underlying configuration that is creating that resource will need to be removed from your configuration files (not just from your environment). If you remove the resources from outside of the LZA configuration files, that effectively causes your environment configuration state to be drifted from the CloudFormation template, which can cause such errors to be encountered. More details about that can be found in Configuration file best practices in our documentation.

In order to resolve the issue, remove the configurations for the propagations from your network-config.yaml configuration file. The definition for the propagations, based on the error messages, should be under your VPC named EndpointsVpc. The propagations would be defined under the routeTablePropagations property in the TransitGatewayAttachmentConfig for the VPC in question.

I hope this information is helpful, and please feel free to respond back if this doesn't solve your issue. Thanks!

david-midlink commented 11 months ago

Hello @awsclemj,

Indeed, the information is correct, but regrettably, it can't be removed. I initially reconfigured and cleared the entire network from the YAML file. When the pipeline got stuck, I tried to eliminate resources the pipeline couldn't remove on its own.

Unfortunately, all these efforts were unsuccessful. There's an absence of resources, and CloudFormation is unable to address this shortfall. I also attempted to delete the stack related to associations in the network account, hoping that redeployment would resolve the issue.

However, the stack remains stuck due to an "Internet failure" error, and I'm unable to address it. I've raised a case with AWS, which was escalated to their internal CloudFormation team, but there's been no resolution so far.

I'm open to any further suggestions. Additionally, I'm curious if this might be tied to LZA or simply a CloudFormation issue.

Could the discrepancy be due to a case sensitivity issue? The name of the account is "CorpIT," but the logs display it as "CorpIt" with a lowercase 't'. I couldn't find any reference to this in the LZA code.

awsclemj commented 11 months ago

Hi @david-midlink,

Thank you for the added context. I am unable to comment on what could be causing the Internal Error since I am unable to see the logs for the underlying service. I think that opening the support case is the correct path since AWS Support can work directly with the service team on the issue.

I do not believe the lowercase 't' is the issue; we use a pascalcase parser in our solution to generate the resource names, so that is likely why you see the deviation in the logs.

I'm curious if you have done a full pipeline run (i.e. manually releasing a change) since removing the resources from the YAML file? LZA shouldn't be trying to create the resources if they are no longer in your configuration. Simply retrying the stage would use the older configuration file since the new config wouldn't have been sourced in the initial Source stage, so that could potentially be the source of the issue..

david-midlink commented 11 months ago

Hello @awsclemj

Unfortunately, nothing proved effective, whether running it with or without the configuration (meeting the minimum requirements for validation).

As of now, AWS has not responded to my case regarding the issue. However, for some reason, after not using it for approximately three days, I attempted to delete the stack, and it succeeded. Following that, I redeployed my entire network configuration, and it worked. So, in reality, I have no idea what happened.

awsclemj commented 11 months ago

Hello @david-midlink, and thanks for following up!

I am glad to hear you are now unblocked in your environment. I will go ahead and close out this issue, but please don't hesitate to open another issue with us or AWS Support should you run into pipeline execution issues going forward.

Thank you for your interest and support of the LZA!

snemir2 commented 7 months ago

FYI, looks like I am hitting a very similar issue. Does deleting the stack/retrying helps/works?