awslabs / landing-zone-accelerator-on-aws

Deploy a multi-account cloud foundation to support highly-regulated workloads and complex compliance requirements.
https://aws.amazon.com/solutions/implementations/landing-zone-accelerator-on-aws/
Apache License 2.0
558 stars 446 forks source link

Prepare Stage Timeout via Control Tower actions in progress #576

Open richardkeit opened 1 month ago

richardkeit commented 1 month ago

Describe the bug Prepare stage Codebuild times out before all actions are completed.

To Reproduce ControlTower Enabled via LZA global-config.yaml:

controlTower:
  enable: true
  landingZone:
    version: '3.3'
    logging:
      loggingBucketRetentionDays: 365
      accessLoggingBucketRetentionDays: 365
      organizationTrail: true
    security:
      enableIdentityCenterAccess: true

Changes requiring multiple OU registration & potentially Landing Zone drift.

Expected behavior Prepare stage can run for duration of 8 hours or more (given there is failure logic in the Codebuild)

Please complete the following information about the solution:

Screenshots Timeout

Screenshot 2024-09-24 at 9 09 00 AM

Job processing:

Screenshot 2024-09-24 at 9 09 07 AM

OU's left unenrolled:

Screenshot 2024-09-24 at 9 08 41 AM

Additional context I am not sure how the Codebuild project duration can be configured to 8 hours and then on stage add is reduced (don't see it configured in that way)

https://github.com/awslabs/landing-zone-accelerator-on-aws/blob/485c1ad6e8db98d994efadfd530b49c8d34ceed8/source/packages/@aws-accelerator/accelerator/lib/pipeline.ts#L456-L460

https://github.com/awslabs/landing-zone-accelerator-on-aws/blob/485c1ad6e8db98d994efadfd530b49c8d34ceed8/source/packages/@aws-accelerator/accelerator/lib/pipeline.ts#L579-L588

padebnat commented 1 month ago

Thank you for contacting LZA Team. Can you please confirm the number of Organizational Units (OUs) in this environment that need to be registered? This information will help us test and optimize the code build run time at our end. Retrying the prepare stage should re-register remaining OUs. Alternatively, if the ACCELERATOR_NO_ORG_MODULE variable is set to Yes for the prepare code build, the OU registration will be performed using the Prepare stack.

richardkeit commented 1 month ago

Thank you for contacting LZA Team. Can you please confirm the number of Organizational Units (OUs) in this environment that need to be registered? This information will help us test and optimize the code build run time at our end. Retrying the prepare stage should re-register remaining OUs. Alternatively, if the ACCELERATOR_NO_ORG_MODULE variable is set to Yes for the prepare code build, the OU registration will be performed using the Prepare stack.

Hello @padebnat , In the initial run, there was a landing Zone reset (~20 minutes) and 14 OUs to register (1 ignored OU).

Rerunning the pipeline does not succeed.

Screenshot 2024-09-25 at 11 33 35 AM

First timed out at: 2024-09-23 23:18:33.390

(Rerun of stage action) Second job:

2024-09-24 01:09:42.391 | info | index | The organization unit "Workloads/Prod" already exists in AWS Organizations, create organizational operation skipped.
2024-09-24 01:09:42.604 | info | index | The organizational unit "Workloads/Prod" baseline status is "SUCCEEDED", update baseline skipped. 
...
AWSAccelerator-PrepareStack-XXXXXXXXXX-ap-southeast-2 | 1:13:44 AM | UPDATE_FAILED        | Custom::ValidateEnvironmentConfiguration           | ValidateEnvironmentConfig/ValidateEnvironmentResource/Default (ValidateEnvironmentConfigValidateEnvironmentResourceD10DC179) Received response status [FAILED] from custom resource. Message returned: Organizational Unit "Workloads/Non-Prod" not found.,Organizational Unit "Workloads/Prod" not found.,Organizational Unit Workloads/Non-Prod does not exist in AWS. Either remove from configuration or add OU via console.,Organizational Unit Workloads/Prod does not exist in AWS. Either remove from configuration or add OU via console.

From the Control Tower Console, I clicked Re-register OU:

        {
            "eventVersion": "1.08",
            "userIdentity": {
                "accountId": "XXXXXXXXXX",
                "invokedBy": "AWS Internal"
            },
            "eventTime": "2024-09-24T01:32:02Z",
            "eventSource": "controltower.amazonaws.com",
            "eventName": "RegisterOrganizationalUnit",
            "awsRegion": "ap-southeast-2",
            "sourceIPAddress": "AWS Internal",
            "userAgent": "AWS Internal",
            "requestParameters": null,
            "responseElements": null,
            "eventID": "19470ec6-0618-4a89-95be-e597281dd448",
            "readOnly": false,
            "eventType": "AwsServiceEvent",
            "managementEvent": true,
            "recipientAccountId": "XXXXXXXXXX",
            "serviceEventDetails": {
                "registerOrganizationalUnitStatus": {
                    "organizationalUnit": {
                        "organizationalUnitName": "Prod",
                        "organizationalUnitId": "ou-yyyyy"
                    },
                    "state": "SUCCEEDED",
                    "message": "AWS Control Tower successfully registered an organizational unit.",
                    "requestedTimestamp": "2024-09-24T01:30:55+0000",
                    "completedTimestamp": "2024-09-24T01:32:02+0000"
                }
            },
            "eventCategory": "Management"
        }

(Rerun of stage action)Third Job succeeded:

2024-09-24 01:41:16.584 | info | index | The organization unit "Workloads/Prod" already exists in AWS Organizations, create organizational operation skipped.
2024-09-24 01:41:16.795 | info | index | The organizational unit "Workloads/Prod" baseline status is "SUCCEEDED", update baseline skipped.
2024-09-24 01:41:16.795 | info | index | No accounts found for organizational unit "Prod" to be invited to the AWS Organizations.

When looking to it, I'm thinking there may be a bug with loop https://github.com/awslabs/landing-zone-accelerator-on-aws/blob/36d87e6535e2a9c208980a9a4c726dac8ec18d2f/source/packages/@aws-accelerator/modules/lib/aws-organization/index.ts#L202-L206

As given the timestamps, I believe that baselineStatus is incorrectly referred as SUCCEEDED Not a typescript dev, so happy to be told otherwise