aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.55k stars 3.87k forks source link

AwsCustomResource: Race condition in IAM policy updates #18237

Open lordjabez opened 2 years ago

lordjabez commented 2 years ago

What is the problem?

I have a construct that performs several AWS SDK calls using AwsCustomResource, each of which should have an IAM policy scoped as tightly as possible. Sometimes, later calls fail with a permission error, even though the IAM policy being applied is correct.

Per the documentation: "As this custom resource uses a singleton Lambda function, it's important to note the that function's role will eventually accumulate the permissions/grants from all resources."

What I discovered is that earlier custom resource calls succeed because it takes a little while (~60 seconds) for the Lambda to be created, allowing enough time for the associated IAM policy to propagate. However, subsequent custom resource calls reuse the Lambda, and if it executes too quickly after the policy is edited with the new permission, it won't have said permission and will fail.

Reproduction Steps

I can provide code on request but can't yet post publicly (working on getting permission to do so). But to reproduce should be straightforward:

  1. Create an AwsCustomResource call that relies on policy A. Execute that call.
  2. Create an AwsCustomResource call that relies on policy B. Execute that call.

If the second call happens fast enough after the policy is applied, it will fail.

What did you expect to happen?

All custom resource calls succeed.

What actually happened?

One of the custom resource calls failed with a permissions error.

CDK CLI Version

2.3.0 (build beaa5b2)

Framework Version

No response

Node.js Version

v16.7.0

OS

MacOS

Language

Typescript

Language Version

4.5.4

Other information

I have a CloudFormation event log I can share that illustrates the problem. Contact me directly and I can provide it.

lordjabez commented 2 years ago

I see four potential fixes (there may be others):

  1. Sleep in the custom resource runtime code long enough to be reasonably sure any associated IAM policies have propagated. Hacky, and has Lambda timeout implications, but quick and easy.

  2. Implement simple retry logic in the above code. Would be easy enough, with say a hard-coded 10s, 30s, 60s backoff. Or could be parameterized.

  3. Change AWS custom resource calls to be async, and implement the retry logic in the state machine poller. Seems a slightly more natural place for it, and easier to implement customization parameters. But would have implications for all other uses of AWS custom resource calls.

  4. Modify the CDK framework so that it doesn't reuse the same Lambda for all custom resource calls in a given stack (or at least has an option to not reuse). I tried specifying a custom functionName parameter across multiple calls hoping it would force different functions, but the first custom name was used and the others were ignored. Perhaps this could be the mechanism to use separate Lambdas? Less performant but would help with my use case.

rix0rrr commented 2 years ago

If all is well, there should be DependsOn relationship between the Custom Resource invocation and the Policy that adds the necessary permissions. It will look like this (taken from an integration test snapshot):

{
    "PublishCustomResourcePolicyDF696FCA": {
      "Type": "AWS::IAM::Policy",
      "Properties": {
        "PolicyDocument": {
          "Statement": [ { ... } ],
        },
        "Roles": [
          { "Ref": "AWS679f53fac002430cb0da5b7982bd2287ServiceRoleC1EA0FF2" }
        ]
      }
    },
    "Publish2E9BDF73": {
      "Type": "Custom::SNSPublisher",
      "Properties": {
        "ServiceToken": { ... },
        "Create": { ... },
        "Update": { ... },
      },
      "DependsOn": [
        "PublishCustomResourcePolicyDF696FCA"  // <------------ this
      ],
    }
}

This dependency will avoid the race condition.

Can you confirm that this DependsOn relationship exists in your template, and if not, it would be very helpful to share the code you are using.

lordjabez commented 2 years ago

Here's snippets from the code. First from Construct A:

    const networkDataSdkCall = {
      service: 'ManagedBlockchain',
      action: 'getNetwork',
      parameters: { NetworkId: this.networkId },
      physicalResourceId: customresources.PhysicalResourceId.of('Id'),
    };

    const networkData = new customresources.AwsCustomResource(this, 'NetworkDataResource', {
      policy: customresources.AwsCustomResourcePolicy.fromSdkCalls({
        resources: [`arn:${partition}:managedblockchain:${region}::networks/${this.networkId}`],
      }),
      onCreate: networkDataSdkCall,
      onUpdate: networkDataSdkCall,
    });

And then from Construct B (the one that sometimes fails):

    const nodeDataSdkCall = {
      service: 'ManagedBlockchain',
      action: 'getNode',
      parameters: { NetworkId: networkId, MemberId: memberId, NodeId: this.nodeId },
      physicalResourceId: customresources.PhysicalResourceId.of('Id'),
    };

    const nodeData = new customresources.AwsCustomResource(this, 'NodeDataResource', {
      policy: customresources.AwsCustomResourcePolicy.fromSdkCalls({
        resources: [`arn:${partition}:managedblockchain:${region}:${account}:nodes/${this.nodeId}`],
      }),
      onCreate: nodeDataSdkCall,
      onUpdate: nodeDataSdkCall,
    });

The synthesized CloudFormation has the proper DependsOn between the update to the policy and the AwsCustomResource call in Construct B, so the policy is indeed created first. But because of IAM propagation delay, it sometimes isn't fully "ready" yet by the time the Lambda runs.

lordjabez commented 2 years ago

Here's a screenshot of the CloudFormation events that illustrate the issue:

Screen Shot 2022-01-01 at 2 49 40 PM

rix0rrr commented 2 years ago

Alright, thanks for the thorough report.

Seems like built-in retry logic would be the best solution. I'm just a little concerned that it might be hard to classify the right error code from all services, since I'm pretty sure there's no standard for it. We can probably start with AccessDenied and take it from there though.

ibliskavka commented 1 year ago

I am experiencing this issue with connect:AssociateBot and lex:UpdateBotLocale.

The custom resource succeeds when turn off rollback and run the deploy multiple times - so I am certain its a policy propagation issue.

In the meantime I broadened my permission policies but a retry on AccessDenied would be perfect.

caguado commented 4 months ago

I have found this race condition as a consequence of disabling installLatestAwsSdk from a separate issue.

As observed in the analysis above from @lordjabez, the singleton lambda keeps running in the same CFN run and IAM permissions don't appear to propagate on time. I posit that such effect is only apparent because Lambda still running means the assumed role still uses the previous inline policy version, so even if IAM propagates the policy changes fast enough, Lambda may not spin a new function that uses them at all.

I have had success by forcing a grant on the default identity policy of the underlying Lambda role because that policy is deployed early on and together with the SingletonFunction. So, besides the retry, I would suggest to explore how to coalesce policies or added to the default role policy so that those are deployed before the custom resource's Lambda is first invoked.

The grant snippet below, using SSM as example, works for me as a workaround for now:

const resource = existingResource ?? otherConstruct.node.defaultChild as AwsCustomResource;

const grant = Grant.addToPrincipal({
    grantee: resource,
    actions: ['ssm:GetParameter'],
    resourceArns: AwsCustomResourcePolicy.ANY_RESOURCE,
});
grant.assertSuccess();