aws-ia / terraform-aws-control_tower_account_factory

AWS Control Tower Account Factory
Apache License 2.0
604 stars 386 forks source link

Insufficient error handling when Control Tower fails to create an account #456

Open evan10s opened 1 month ago

evan10s commented 1 month ago

Terraform Version & Prov: Terraform 1.5.5, open-source

AFT Version: 1.12.1

Terraform Version & Provider Versions Please provide the outputs of terraform version and terraform providers from within your AFT environment

I can provide these if needed, but leaving them out because it's non-trivial to run the AFT terraform locally, and I don't think this issue is related to Terraform.

Bug Description High-level: Sometimes, an AFT request for an account is syntactically correct, but Control Tower fails to actually create the account. Today, we had an account request where this happened, but the only error we got from AFT was about a failed call to DescribeAccount, even though the true error was from Control Tower and visible in Service Catalog in our management account.

More specifically: The aft-invoke-aft-account-provisioning-framework Lambda does not proactively catch when the CreateManagedAccount call failed in Control Tower, and as a result the event data sent to the Lambda lists the account ID as Not Available. aft-invoke-aft-account-provisioning-framework then tries to call AWS Organization's DescribeAccount method with the account ID Not Available, which causes boto3 to throw an exception (An error occurred (InvalidInputException) when calling the DescribeAccount operation: You provided a string that exceeds that maximum length.).

This does cause a failure to propagate in AFT, which is good, but proactively catching and erroring with a more descriptive message than the DescribeAccount error would make this easier to debug. Even saying that the account creation request failed would have sent my debugging in a more productive direction.

To Reproduce Steps to reproduce the behavior:

  1. Create an AFT account request where the account email is already used as the root account email for another AWS account.
  2. Commit the change so AFT's pipelines will apply the Terraform and AFT will try to create the account.
  3. The aft-failure-notifications SNS topic should receive a message with the DescribeAccount error mentioned above.

Expected behavior If the account fails to create in Control Tower, AFT should error with that information rather than trying to continue with an invalid account ID.

Related Logs I think I shared everything that's relevant elsewhere, can grab more logs if that would be useful.

Additional context We had an AFT account request fail on initial creation due to this error, which I found in Service Catalog's provisioned products list:

AWS Control Tower cannot create an account using email user@company.com because an AWS account with that email already exists, but it is not part of your AWS Control Tower organization.

However, the error sent from the AFT failures SNS topic just had this less useful error:

AFT account request failed

An error occurred in the 'aft-invoke-aft-account-provisioning-framework' Lambda function. For more information, search AWS Request ID 'c3e55225-a997-47a6-b3d7-6be2e2eea65d' in CloudWatch log group '/aws/lambda/aft-invoke-aft-account-provisioning-framework' Error Message: An error occurred (InvalidInputException) when calling the DescribeAccount operation: You provided a string that exceeds that maximum length.

After looking at the source code for the aft-invoke-aft-account-provisioning-framework Lambda, I found that it calls the DescribeAccount operation on a Control Tower event, but I couldn't figure out what the actual problematic event content was.

Tracing back, I noted that the aft-controltower-event-logger EventBridge rule triggers the aft-invoke-aft-account-provisioning-framework Lambda. I went to look at the EventBridge rule and saw it also triggers the aft-controltower-event-logger Lambda, which I noted writes to the aft-controltower-events Dynamo table, so I went to that table. Finally, I found the problematic event:

Partially redacted event JSON

```json { "id": "99f09435-b193-e673-a50e-bf61ee0fa086", "time": "2024-05-09T19:45:19Z", "account": "redacted", "detail": { "awsRegion": "us-east-1", "eventCategory": "Management", "eventID": "b2e5ef71-f7cc-46f9-b0fe-597e36413ebb", "eventName": "CreateManagedAccount", "eventSource": "controltower.amazonaws.com", "eventTime": "2024-05-09T19:45:19Z", "eventType": "AwsServiceEvent", "eventVersion": "1.08", "managementEvent": true, "readOnly": false, "recipientAccountId": "redacted", "requestParameters": null, "responseElements": null, "serviceEventDetails": { "createManagedAccountStatus": { "account": { "accountId": "Not Available", "accountName": "redacted-sandbox" }, "completedTimestamp": "2024-05-09T19:45:19+0000", "message": "AWS Control Tower failed to create an enrolled account.", "organizationalUnit": { "organizationalUnitId": "Not Available", "organizationalUnitName": "Sandbox" }, "requestedTimestamp": "2024-05-09T19:44:54+0000", "state": "FAILED" } }, "sourceIPAddress": "AWS Internal", "userAgent": "AWS Internal", "userIdentity": { "accountId": "redacted", "invokedBy": "AWS Internal" } }, "detail-type": "AWS Service Event via CloudTrail", "region": "us-east-1", "resources": [ ], "source": "aws.controltower", "version": "0" } ```

snebhu3 commented 1 week ago

@evan10s thank you for reporting this. I will create an internal backlog to evaluate this request further.