awslabs / landing-zone-accelerator-on-aws

Deploy a multi-account cloud foundation to support highly-regulated workloads and complex compliance requirements.
https://aws.amazon.com/solutions/implementations/landing-zone-accelerator-on-aws/
Apache License 2.0
557 stars 444 forks source link

Pipeline stuck / broken after console VPN change #133

Closed twdcnz closed 1 year ago

twdcnz commented 1 year ago

Current Status I have resolved this within my LZA deployment, but I'm not entirely sure why it's resolved. I'm logging it because I think it would benefit from some attention.

Because we couldn't get the bug solved quickly, even with the help of our AWS account manager, we had to deploy all required resources manually in the console. We've deleted the LZA pipeline because if someone checked code in it would break production. So as well fixing as the actual bug, there needs to be some way to get support from AWS fairly quickly.

Describe the bug A static AWS site to site VPN was deployed using LZA. A user made a manual change to the VPN, after which the pipeline wouldn't deploy any more - it would get stuck at the "Network_Associations" stage. The cause seems to be a CloudFormation stack in the network-production account is in UPDATE_ROLLBACK_FAILED due to a bug with a custom resource.

AWSAccelerator-NetworkAssociationsStack-0123456789-us-east-1

Last week I used "Continue rollback" multiple times but the stack stayed in "UPDATE_ROLLBACK_FAILED" state. I tried to "continue rollback" again today and the stack transitioned to "UPDATE_ROLLBACK_COMPLETE" state. I suspect the reason may have something to do with the state of the VPN tunnels, we just got one of the sets of VPN tunnels up, but that's only a guess. Nothing else major has changed in the accounts that I'm aware of that would affect the ability to roll back between last week and this week.

The CloudFormation stack had two resources in UPDATE_FAILED state. Screenshot below and here's a copy and paste with confidential information removed.

Logical ID Vpn1VpnTransitGatewayAttachment2359179F Status Reason Received response status [FAILED] from custom resource. Message returned: Error: Unable to find VPN attachment vpn-1 at am (/var/task/index.js:32:16584) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async im (/var/task/index.js:32:15933) at async Runtime.handler (/var/task/entrypoint.js:1:912) (RequestId: 89176e91-d36f-481a-ad72-cd4d85e971f3)

Logical ID Vpn2VpnTransitGatewayAttachmentB9F159FA Status Reason Received response status [FAILED] from custom resource. Message returned: Error: Attachment vpn-2 for tgw-0eadfc74a39ecfe71 not found at im (/var/task/index.js:32:16055) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async Runtime.handler (/var/task/entrypoint.js:1:912) (RequestId: 0bc0cbea-cf64-41a5-8739-5abdf8ea7132)

I also tried commenting out the VPNs in the LZA configuration, which didn't work either. The CloudFormation stack was still in "UPDATE_ROLLBACK_FAILED " state.

Somewhat related pipeline issue

I managed to break the pipeline in a couple of other ways as well. The one I remember right now is changing a VPC CIDR. Create a VPC and subnets using LZA, let it deploy, then change the CIDR range of the VPC and / or subnets. I don't have details of the error messages, but I ended up having to delete the VPC in the console, comment it out in the LZA configuration, then put it back and deploy fresh again. I hadn't even deployed any resources to the VPC. The LZA pipeline in general seems to have problems with modifying / deleting resources.

To Reproduce

I haven't tried to reproduce this as I don't have a spare LZA landing zone, and establishing / disestablishing one takes significant time and effort.

  1. Stand up LZA including two statically routed VPNs attached to a TGW.
  2. In the AWS console, "Modify VPN connection options". Change the CIDR to something random, then back. If this doesn't cause the pipeline to fail change something else, I'm not 100% sure what was changed as I didn't do it.
  3. Check in a change to the LZA pipeline around the VPN. For example, change the name of the vpn from 'vpn-1' to 'vpn-2'.

Expected behavior Pipeline should continue to work.

Please complete the following information about the solution:

Screenshots If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

image

CloudFormation Stack AWSAccelerator-NetworkAssociationsStack-123456789012-us-east-1

image

Additional context Add any other context about the problem here.

awsclemj commented 1 year ago

Hello @timwilddatacom, and thank you for your interest in Landing Zone Accelerator!

Based on the symptoms you've described, it sounds as though some drift was introduced from manual actions taken on a VPN connection after it was deployed by LZA. We strongly advise against modifying resources outside of the LZA configuration files, as our underlying CDK codebase cannot track those changes. Since CDK is a superset of CloudFormation, this best practice applies. Note that with LZA you do not need to perform the updates suggested in the referenced doc -- our pipeline is managing that process for you in the background.

It sounds as though specifically in this case the VPN name may have been changed. A custom resource we use to look up VPN IDs after they've been created relies on the Name tag of the resource, so any manual changes to that name outside of the LZA configuration files would cause this error.

Regarding your comments on changing CIDRs of a VPC, this action should complete successfully so long as the VPC is completely empty, i.e. there are no interface or gateway endpoints also deployed to it. Since you don't have the specific error message I cannot comment on what may have caused this particular issue. What I can say is that our next Implementation Guide update and configuration reference updates will have expanded guidance on modifying resources after they have been deployed.

Regarding your comments on support for the solution, have you seen our FAQ on getting support? The LZA is fully supported by AWS Support Engineering, so long as you have a support plan with AWS.

I hope these comments were helpful! I will leave this issue open until our next release so we can confirm with you if the expanded guidance in our documentation helps with your use cases. Please feel free to follow-up here in the meantime!

twdcnz commented 1 year ago

Thanks Clem 👍

Ah that's the support category to use. I looked in the list of supported services and didn't see LZA. In any case, my customer had support on the workload accounts but not the root / network accounts, so I couldn't have accessed support without their permission to increase support coverage.

It might be that the VPN name was changed, I don't recall exactly what the user did. VPN names were changed occasionally to trigger complete VPN replacement. I should have logged the bug while everything was fresh in my mind, but instead I emailed our AWS account manager, who couldn't find anyone in AWS who could help. I might try to reproduce if I get some time next week. I explicitly remember changing the names to match the original deployment and running the pipeline, and it still wouldn't complete the pipeline / let the stack roll back.

Could there be some kind of optimization to the custom code that LZA uses for VPNs to cater for this situation better?

It would be useful if there was some kind of guidance on how to resolve this kind of error. When the pipeline won't run, and one of the CF stacks is stuck in "UPDATE_ROLLBACK_FAILED", all we could do was delete the LZA pipeline, abandon LZA, and deploy using the console, the customer couldn't wait for us to work this out. For example: could we have just deleted the CF stack that was in "rollback failed" state, would the pipeline have recreated it? I did try to delete the stack, but my admin role didn't have permissions, and I decided if it was that well protected I shouldn't log in as root to do it.

awsclemj commented 1 year ago

In such a situation my advice would be to use the "Continue update rollback" feature and use the advanced options dropdown to ignore the resource(s) that are causing the rollback to fail. In your case, the VPN connections would then be orphaned from the stack and you can manage them outside of the LZA. More info on how to do that in this doc.

You can delete the stack as well, however that may negatively impact your network resources that require associations managed by that stack. "Continue update rollback" is a much safer option, especially for networks that are in production.

I hope this helps!

twdcnz commented 1 year ago

@awsclemj I did use "continue update rollback", multiple times, but it failed. I described that above. If that had worked it would have been fine.

It was only a week later that I tried "continue update rollback" again that the rollback worked. I don't know why it worked a week later. The VPN name hadn't changed in that week as far as I know, but the VPN state had, so my guess was it had something to do with the VPN state.

If I have some time next week I'll try to reproduce.

awsclemj commented 1 year ago

Understood, thanks! I was unsure based on your description if you had explicitly skipped the resources in question. Please keep us updated if you're able to reproduce! :)

twdcnz commented 1 year ago

Do you mean the skip resources in CloudFormation? There was no option to skip resources during the "continue rollback". I know CF sometimes gives you that kind of option when you're deleting stacks, but I haven't seen it during rollback.

Is there an easy way to reduce the LZA costs for test scenarios? I will probably disable logging, config, network firewall. There's AWS partner LZA training next week anyway.

The LZA documentation could use a bit of work, which tends to happen over time. Trying to work out the correct syntax to stand up a statically routed VPN took quite a while, particularly since it takes 60 - 90 minutes to run the pipeline and let it get down to the network section - a faster pipeline would be a HUGE win. The format for the shared secret in secrets manager is really odd, and not well described. The format for the VPN themselves I found in an example someone posted somewhere.

awsclemj commented 1 year ago

Yes, there is an option to skip failed resources in an "Advanced troubleshooting" drop down box that appears when you choose to continue update rollback from the console. I found this support article which may help if the scenario occurs for you again. It's the second scenario listed.

As far as reducing costs, you can simply reduce the number of resources that are deployed via configuration. The mandatory components such as centralized logging will still be deployed, but you can certainly scale down the number of user-defined resources to use for test scenarios.

Regarding the documentation, we are actively working on overhauling the configuration reference documentation with additional context, guidance, and clarity. We sincerely appreciate your feedback!

twdcnz commented 1 year ago

Thanks Jimmy :) I don't think I have ever found the CloudFormation advanced troubleshooting dialog, that's helpful!

Is there any way to skip pipeline stages while still pulling the latest config code from the repo? It can take between one and two hours to run the pipeline, network stages are near the end, so it can take a really long time to test anything with LZA. Some kind of smart pipeline that only runs the required pipeline stages would be a HUGE benefit.

awsclemj commented 1 year ago

There is not a way to do this from a console, however we do expose the core CLI we use to invoke the pipeline stages. This CLI can be leveraged locally to deploy targeted changes to specific accounts and regions. Note that this does require a local installation of our development dependencies. We have more details on this in DEVELOPING.md. I hope this helps!

tomwaldnz commented 1 year ago

Thanks Jimmy. That seems a bit complex for general use. It would be good to see the pipeline optimised. For example, when I change the name of a VPN in the LZA config the only useful pipeline stages are the ones that pull the source and change the network connections, total time about 15 minutes max, but every other pipeline stage runs which takes about an hour.

I did the partner LZA training last week and kept my environment. I've deployed a VPN, made all kinds of changes in the console including VPN name, options, all kinds of things, and the pipeline manages to update the VPN again no problem. I haven't attached the VPN to a TGW, that's where the problem was that I logged initially. I'll see if I can reproduce that tomorrow if I get time.

twdcnz commented 1 year ago

I haven't been able to reproduce the issue. Someone will have changed something odd in the console, causing the problem. It would be good to have an enhancement so that the pipeline didn't get stuck so easily, maybe someone will reproduce it one day then it can be fixed more easily.