aws-solutions / aws-control-tower-customizations

The Customizations for AWS Control Tower solution combines AWS Control Tower and other highly-available, trusted AWS services to help customers more quickly set up a secure, multi-account AWS environment using AWS best practices.
https://docs.aws.amazon.com/controltower/latest/userguide/cfct-overview.html
Apache License 2.0
354 stars 205 forks source link

Frequent ConcurrentModificationException on running SCP updates #175

Closed Cihl28 closed 4 months ago

Cihl28 commented 8 months ago

Describe the bug We're using Control Tower with CfCT for deploying various SCPs to Organizations. About 50% of the time, the pipeline fails. Example step output: { "errorMessage": "An error occurred (ConcurrentModificationException) when calling the EnablePolicyType operation: AWS Organizations can't complete your request because it conflicts with another attempt to modify the same entity. Try again later.", "errorType": "ConcurrentModificationException", "stackTrace": [ " File \"/var/task/state_machine_router.py\", line 218, in lambda_handler\n return service_control_policy(event, function_name)\n", " File \"/var/task/state_machine_router.py\", line 113, in service_control_policy\n response = scp.enable_policy_type()\n", " File \"/var/task/cfct/state_machine_handler.py\", line 1027, in enable_policy_type\n scp.enable_policy_type(root_id)\n", " File \"/var/task/cfct/aws/services/scp.py\", line 127, in enable_policy_type\n self.org_client.enable_policy_type(\n", " File \"/var/task/botocore/client.py\", line 391, in _api_call\n return self._make_api_call(operation_name, kwargs)\n", " File \"/var/task/botocore/client.py\", line 719, in _make_api_call\n raise error_class(parsed_response, operation_name)\n" ] }

To Reproduce Run the pipeline. Fails frequently.

Expected behavior SCPs deployed without errors.

Please complete the following information about the solution:

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0089) - customizations-for-aws-control-tower Solution. Version: v1.0.0". You can also find the version from releases

Additional context Case number with AWS support: 14155855571 Response from engineer:

Hello!

Thanks for providing the error from the step function log output!

I was able to research this issue internally to better understand what is encountered and found that the Control Tower service team has been made aware of the issue. The internal team will be prioritizing resolving this behavior in an upcoming release as it requires implementing a code change to the following python ("scp.py") file which invokes the ("enable_policy_type") function on Line 125-139 to make the "EnablePolicyType" API call [1][2]. The changes proposed will better handle the ("ConcurrentModificationException") error and retry the "EnabledPolicyType" API call when the issue is encountered.

Per my understanding, the CfCT pipeline solution is currently designed to call "EnablePolicyType" API call when enabling a service control policy type for the organization before creating the policy. Per the documentation, “EnablePolicyType” API call enables a policy type in a root [2]. Its a one-time operation and after you enable a policy type in a root, you can attach policies of that type to the root, any organizational unit (OU), or account in that root. This means you do not have to call this API every time you create a new SCP policy. Once the SCP policy type is enabled for the root of the organization you can create SCP policies and attach them to the root, OUs or accounts.

As the document outlines, this is an "asynchronous" request that AWS performs in the background. AWS recommends that "ListRoots" is first used to see the status of policy types for a specified root, and then use “EnablePolicyType” operation only if your desired policy type (ex. Service Control Policy) is not enabled for the root.

Now in your case, since multiple policies are being created in parallel then the “EnablePolicyType” API is being called every time which leads to concurrent actions occurring at same time. In some occassions, when calling “EnablePolicyType” API the following error (“The specified policy type is already enabled.”) can also be received which is expected because we need to enable any policy type only once. We find this error exception already handled in the python ("scp.py") file on Line 130-135.

However, the “EnablePolicyType” API is an asynchronous request so it takes some time to process one request and return the successful code/error code. As a result, when another request is made for the same policy type at the same time it could lead to the encountered ("ConcurrentModificationException") which means that one request is already in progress and you should try again later.

Overall, the internal team plans to have this bug/issue resolved in their next release which is targetted by end of year. In the meantime, they recommend retrying the pipeline stage in the interim while the release is made. At this time, you can reference the AWS Control Tower GitHub Releases page for when the latest version with changes is made public [3]. Please feel free to raise this concern in the GitHub Issues page so you can publicly track the issue as well [4].

I hope the above provided some valuable information to you. I’m located in Seattle, WA with an availability from Mon.-Fri. (9:00AM-6:00PM PST). If any additional questions or concerns, please feel free to contact us back and we would be happy to help you out.

Thanks again!

Cihl28 commented 8 months ago

Further thoughts for your consideration:

stumins commented 8 months ago

Hi @Cihl28,

Thanks for reporting this bug. We are aware of the issue and have scheduled a patch for the next release.

stumins commented 7 months ago

Hi @Cihl28,

CFCT v2.7.0 was just released with a patch for this issue. Please upgrade to v2.7.0 and let us know if you continue to experience this error.

Cihl28 commented 7 months ago

Just noticed this. I'll make plan to have this update applied soon.

NobuHiramatsu commented 6 months ago

@stumins I had the same error as above. Thanks for the improvement in v2.7.0.

However, in v2.6.0, the same error rarely occurs not in EnablePolicyType operation but in UpdatePolicy operation, as shown in the error statement below.

{
  "errorMessage": "An error occurred (ConcurrentModificationException) when calling the UpdatePolicy operation: AWS Organizations can't complete your request because it conflicts with another attempt to modify the same entity. Try again later.",
  "errorType": "ConcurrentModificationException",
  "stackTrace": [
    "  File \"/var/task/state_machine_router.py\", line 218, in lambda_handler\n    return service_control_policy(event, function_name)\n",
    "  File \"/var/task/state_machine_router.py\", line 77, in service_control_policy\n    response = scp.update_policy()\n",
    "  File \"/var/task/cfct/state_machine_handler.py\", line 882, in update_policy\n    response = scp.update_policy(\n",
    "  File \"/var/task/cfct/aws/services/scp.py\", line 80, in update_policy\n    response = self.org_client.update_policy(\n",
    "  File \"/var/task/botocore/client.py\", line 391, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/var/task/botocore/client.py\", line 719, in _make_api_call\n    raise error_class(parsed_response, operation_name)\n"
  ]
}

Since there appears to be no modification to the update_policy method in this update, I fear that the UpdatePolicy operation will continue to generate ConcurrentModificationException in v2.7.0.

v2.6.0 https://github.com/aws-solutions/aws-control-tower-customizations/blob/b1ba765b50480b12cef0f0e06d2a4c26fd53bfea/source/src/cfct/aws/services/scp.py#L78

v2.7.0 https://github.com/aws-solutions/aws-control-tower-customizations/blob/2fa6e6170230dc97410006897e389a3146b5be23/source/src/cfct/aws/services/scp.py#L82

I haven't been able to update yet and have not been able to confirm this issue with v2.7.0, but are you aware of this issue?

Are you aware of this issue and if so, do you plan to fix it?

Cihl28 commented 4 months ago

I can now confirm that this issue is no longer present for me in v2.7.0. Thanks for fixing it!

Cihl28 commented 4 months ago

I'll just go ahead and close this issue.