`WaitUntil*` functions don't fail early if AccessDenied

dannyrandall commented 1 year ago

Describe the bug

When using the cloudformation.WaitUntilChangeSetCreateCompleteContext function with an IAM role with an explicit DENY for cloudformation:DescribeChangeSet, I expected that the function would return early (instead of waiting until MaxAttempts) with an error like:

AccessDenied: User: arn:aws:sts::12345:myRole is not authorized to perform: cloudformation:DescribeChangeSet on resource: arn:aws:cloudformation:us-west-2:12345:stack/my-stack/change-set-id with an explicit deny in an identity-based policy

However, an AccessDenied error doesn't match any of the defined Acceptors for that API. Because of this, the function loops up to the defined MaxAttempts, with Delay between each call before the error is finally returned.

Usage example here: https://github.com/aws/copilot-cli/blob/dc6d9535b9fb1cd5cc87975f158b980b85fdd88f/internal/pkg/aws/cloudformation/changeset.go#L119-L124

Expected Behavior

I expected that an AccessDenied error would be returned to the caller immediately upon receiving that response.

Current Behavior

The request.Waiter makes the request continuously for an hour (by default), constantly getting the same error.

Reproduction Steps

Using credentials for an IAM role with the policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Deny",
            "Action": [
                "cloudformation:DescribeChangeSet"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Code:

func main() {
    cfn := cloudformation.New(session.Must(session.NewSession(&aws.Config{
        Region: aws.String("us-west-2"),
    })))

    err := cfn.WaitUntilChangeSetCreateCompleteWithContext(ctx, &cloudformation.DescribeChangeSetInput{
        ChangeSetName: aws.String("arn:aws:cloudformation:us-west-2:12345:stack/stack-name/change-set-id"), // this is invalid input (not a change set), but irrelvant to the example
    })
    log.Printf("WaitUtilChangeSetCreateComplete error: %s", err)
}

Possible Solution

There should be an acceptor to catch 403 responses defined here https://github.com/aws/aws-sdk-go/blob/eee2b1105ebe299ad8f002c227c61b4c1fb9aa57/models/apis/cloudformation/2010-05-15/waiters-2.json#L259-L277.

For example, testing with this worked for me:

{
  "matcher": "status",
  "expected": 403,
  "state": "failure"
},

Additional Information/Context

After glancing at other WaitUntil* functions, I see this extends to other functions as well.

SDK version used

v1

Environment details (Version of Go (`go version`)? OS name and version, etc.)

go version go1.21.1 darwin/amd64

RanVaknin commented 1 year ago

Hi @dannyrandall,

The C2J API model you linked indeed specifies all the data the SDK's code generator needs for creating conditions and values for each waiter.

Since all of the AWS SDKs code is auto-generated from these API model files, the SDK team can't make direct changes to them. This data is defined upstream by the respective AWS services themselves (in this case, CloudFormation) and not on the SDK level.

Regarding terminating the waiter on a 403 error, there are some counterarguments to consider. Although 403 errors are generally not retryable, in distributed systems there can be a valid use case to force retries on 403s. Mainly surrounding propagation delays during IAM role creation or credential acquisition. These delays can lead to transient conditions where the system might actually benefit from retrying a 403 error.

For example, in an EKS environment, a pod might be in the process of scaling down or restarting. During this transient state, the SDK may not have the correct data to make a successful request, resulting in a 403. In such cases, automatically terminating the waiter on a 403 error could be counterproductive, as the error may resolve itself once the pod is back to a working state.

FWIW, I think you are making a reasonable argument, however this should be raised with the Cloudformation service team rather than the SDK team. You can use your AWS console to file a support ticket and raise this issue internally with them.

In the meantime your best option would be to handwrite your own waiter (you can use the generated one as a template) and add the logic to abort on 403s.

Thanks again, Ran~

github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

dannyrandall commented 1 year ago

@RanVaknin thanks for the response! Makes sense that since this logic is shared across all the SDKs, the issue belongs with the CloudFormation team. I'll open a ticket from the AWS Console with them!😊

aws / aws-sdk-go