aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.56k stars 3.87k forks source link

(AwsSdkCall): AccessDenied when installLatestAwsSdk is false due to credential caching? #23340

Open MikeDombo opened 1 year ago

MikeDombo commented 1 year ago

Describe the bug

I'm using AwsSdkCall custom resource to operate on S3. We had timeout problems in China due to installing the latest SDK taking too long, so I disabled it using installLatestAwsSdk: false. Now though, I'm seeing failures quite often due to access denied which I believe is because AwsSdkCall keeps making/changing the IAM role and IAM is perhaps caching a bit and causing the access denied problem because it will work on retry.

I tried assigning a role to AwsSdkCall which has the necessary permissions rather than letting AwsSdkCall just create a policy, however this has not fixed the problem.

It seems that installing the latest AWS SDK took enough time to ensure that all the credentials were properly configured, which is why I didn't have this problem until I removed that to speed things up and avoid timeouts in slower regions.

Expected Behavior

AwsSdkCall should have a retry configuration, avoid changing IAM roles, or another solution to avoid needing to manually retry the CFN deployment.

Current Behavior

AccessDenied error is raised causing the stack deployment to fail.

Reproduction Steps

const copyObject: custom.AwsSdkCall = {
            action: 'copyObject',
            service: 'S3',
            physicalResourceId: custom.PhysicalResourceId.of(`myResourceId`),
            parameters: {
                Bucket: props.s3Bucket.bucketName,
                CopySource: "/source.zip",
                Key: this.publishedUri
            }
        };

        const copy = new custom.AwsCustomResource(this, `CopyToKnownLocation`, {
            role: props.s3UsageRole,
            // Policy isn't optional, even though I'm giving it a specific role
            policy: {
                statements: [
                    new PolicyStatement({
                        effect: Effect.ALLOW,
                        actions: [
                            's3:Get*',
                            's3:Put*',
                            's3:Copy*'
                        ],
                        resources: [
                            props.s3Bucket.bucketArn,
                            Fn.join("", [props.s3Bucket.bucketArn, '/*'])
                        ]
                    })
                ]
            },
            onCreate: copyObject,
            onUpdate: copyObject,
            installLatestAwsSdk: false,
        });

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

1.179.0

Framework Version

No response

Node.js Version

14

OS

Linux

Language

Typescript

Language Version

No response

Other information

No response

peterwoodworth commented 1 year ago

I'm curious if this is limited to China regions.

I'm seeing failures quite often due to access denied which I believe is because AwsSdkCall keeps making/changing the IAM role and IAM is perhaps caching a bit and causing the access denied problem because it will work on retry.

This behavior is pretty strange to me. To be clear, you are attempting deployment once, it fails, then you retry a second time without changing anything and it succeeds? All of the role/policy modifications should be complete by the time the custom resource starts running - and someone correct me if I'm wrong, but I don't think IAM caches at all. Can you verify that at the time of failure that your role/policy have been completely created and are correctly populated to give you permissions?

MikeDombo commented 1 year ago

Thanks Peter,

To clarify, the China problem was a timeout when installing the AWS SDK. The access denied error is happening in all regions now that I've disabled installing the SDK.

I looked at the dependencies of the CFN node of the custom resource and it does have a dependency on the IAM policy, so the ordering should be right.

So yeah, very weird, but it does work on retry without any changes.

rix0rrr commented 1 year ago

This is a P1. The dependency should be there.

rix0rrr commented 1 year ago

Misclassified this, because I misread the following sentence:

I looked at the dependencies of the CFN node of the custom resource and it does have a dependency on the IAM policy, so the ordering should be right.

I misread that as saying the dependency was missing, but turns out it's there so we're not doing anything wrong (?).

It would be helpful if you could investigate some more to make this more actionable for us.

MikeDombo commented 1 year ago

It happens in all regions.

BwL1289 commented 1 year ago

I believe I am also experiencing this, or something related. Happy to contribute as much info as I can. In our particular case, deployment hangs when install_ latest_aws_sdk is True. It is unclear whether it's a permission issue or a timeout issue. Logs and cloudtrail events are unavailable.

MikeDombo commented 1 year ago

IAM is eventually consistent according to their docs: https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_general.html#troubleshoot_general_eventual-consistency, so really even though IAM responds and says the role/policy is done, the user of the role may still need to retry.

MikeDombo commented 1 year ago

@peterwoodworth

someone correct me if I'm wrong, but I don't think IAM caches at all

I don't believe IAM caches, no, but it does take some amount of time to have the permissions available due to eventual consistency. Given that it works on retry, and it works when using the latest SDK (which takes a little time to download an install), eventual consistency feels like the culprit. And simple retries would also seem to be a reasonable solution to this issue.