aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.51k stars 3.85k forks source link

aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error #15891

Open naseemkullah opened 3 years ago

naseemkullah commented 3 years ago

https://github.com/aws/aws-cdk/blob/beb01b549abc5a5c825e951649786589d2150a72/packages/%40aws-cdk/aws-s3-deployment/lib/lambda/index.py#L150-L163

I've come across a deployment where cloudfront was invalidated but the lambda timed out with cfn_error: Waiter InvalidationCompleted failed: Max attempts exceeded. ~I suspect a race conditon, and that reversing the order of cloudfront.create_invalidation() and cloudfront.get_waiter() would fix this race condition.~

edit: proposed fix of reversing create_invalidation() and get_waiter() is invalid, see https://github.com/aws/aws-cdk/issues/15891#issuecomment-898413309

otaviomacedo commented 3 years ago

Hi, @naseemkullah

Thanks for reporting this and suggesting a solution.

I presume your hypothesis is that, in some cases, the invalidation happens very fast and the waiter gets created after the invalidation has completed, causing it to wait until the timeout is reached. Is that fair?

Also, how easily can you reproduce this issue? Race conditions are usually tricky to test. I would like to get some assurance that the swap will actually fix the issue.

naseemkullah commented 3 years ago

Hi @otaviomacedo,

I presume your hypothesis is that, in some cases, the invalidation happens very fast and the waiter gets created after the invalidation has completed, causing it to wait until the timeout is reached. Is that fair?

Yep, that's right.

Also, how easily can you reproduce this issue? Race conditions are usually tricky to test. I would like to get some assurance that the swap will actually fix the issue.

Not easily 😞 , in fact it is an intermittent issue that I've observed at the end of our CI/CD pipeline (during deployment) once every now and then (rough estimate 1/50). I'm afraid I cannot provide more assurance than the reasoning above. If you don't see any potential issues arising from reversing the order that I may not have thought of, I'll be happy to submit this potential fix. Cheers!

otaviomacedo commented 3 years ago

I think the risk involved in this change is quite low. Please submit the PR and I'll be happy to review it.

naseemkullah commented 3 years ago

After reading up on the waiter, it appears that it uses a poll mechanism, furthermore the ID of the invalidation request needs to be passed into it, so all seems well on that front.

Not sure why I see these timeouts occasionally 👻 .... but my hypothesis no longer holds, closing. Thanks!

edit: re-opened since this is still an issue

github-actions[bot] commented 3 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

peterwoodworth commented 3 years ago

Reopening because additional customers have been impacted by this issue. @naseemkullah are you still running into this issue?

From other customer experiencing the issue Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded

this issue is intermittent and when we redeploy it works. Our pipelines are automated and we deploy 3-5 times everyday in production. When our stack fails due to this error then cloudfront is unable to rollback, which create high severity issues in prod and there is a downtime until we redeploy the pipeline again. This error happens during the invalidation part but somehow cloudfront is not able to get the files from s3 origin when this error occurs. We have enabled versioning in s3 bucket so that cloudfront is able to serve the older version in case of rollback but its still unable to fetch files until we redeploy.

customer's code:

  new s3deploy.BucketDeployment(this, 'DeployWithInvalidation', {
      sources: [s3deploy.Source.asset(`../packages/dist`)],
      destinationBucket: bucket,
      distribution,
      distributionPaths: [`/*`],
      retainOnDelete: false,
      prune: false,
    });

This deploys the files in s3 bucket and creates a cloudfront invalidation which is when the stack fails on the waiter error.

naseemkullah commented 3 years ago

@peterwoodworth yes occasionally! I was a little quick to close it once my proposed solution fell through, thanks for reopening.

otaviomacedo commented 3 years ago

In this case, the most plausible hypothesis is that CloudFront is actually taking longer than 10 min to invalidate the files in some cases. We can try to reduce the chance of this happening by increasing the waiting time, but Lambda has a maximum timeout of 15 min. Beyond that, it's not clear to me what else we can do. In any case, contributions are welcome!

naseemkullah commented 3 years ago

In this case, the most plausible hypothesis is that CloudFront is actually taking longer than 10 min to invalidate the files in some cases. We can try to reduce the chance of this happening by increasing the waiting time, but Lambda has a maximum timeout of 15 min. Beyond that, it's not clear to me what else we can do. In any case, contributions are welcome!

it has happened twice in recent days, next time it occurs i will try to confirm this, iirc the first time this happened i checked and I saw the invalidation event had occurred almost immediately yet the waiter did not see that (that's why i thought it might be a race condition). Will confirm though!

quixoticmonk commented 3 years ago

Noticed the same with a client I support over the last few weeks and makes us rethink using the BucketDeployment construct overall. I will check any new occurrences and confirm the actual behavior of CloudFront in the background.

quixoticmonk commented 3 years ago

In my case, the invalidation kicked off two and both were in progress for a long time and eventually timed out. Screen Shot 2021-09-08 at 9 15 14 AM

naseemkullah commented 3 years ago

Confirming that in my case the validation occurs when it should, but the waiter just never gets the memo and fails the deployment after 10 minutes.

sblackstone commented 3 years ago

I can also confirm this issue occurs with some regularity for me too...

I have a script that that deploys the same stack to 29 different accounts - with a deploy I just did, I had 3 of 29 fail with Waiter InvalidationCompleted failed:

naseemkullah commented 3 years ago

Thought I would mention I've had this happen today twice in row, once upon stack update and the once upon stack rollback, resulting in an UPDATE_ROLLBACK_FAILED state requiring a manual Continue update rollback.

otaviomacedo commented 2 years ago

I raised this issue internally with the CloudFront team. I'll keep you guys updated in this conversation.

jgoux commented 2 years ago

We have this issue as well, multiple times per day (we deploy preview environments so we have a lot of deployments :)).

We also have this variant sometimes :

123/101 |8:51:05 AM | CREATE_FAILED        | Custom::CDKBucketDeployment                     | Cloudfront/BucketDeployment/CustomResource/Default (CloudfrontBucketDeploymentCustomResource2C596BD7) Received response status [FAILED] from custom resource. Message returned: An error occurred (ServiceUnavailable) when calling the CreateInvalidation operation (reached max retries: 4): CloudFront encountered an internal error. Please try again. (RequestId: f93664a0-0a56-4420-aa61-79ea5ed293b2)
iamjaekim commented 2 years ago

We were also impacted by this error as well. One interesting I found from CloudFormation log was, when this error started, it triggered a new resource create which wasn't the case according older log set

What it was :

2021-10-20 11:22:38 UTC-0400 | Stack Name | UPDATE_COMPLETE | -
2021-10-20 11:22:37 UTC-0400 | Stack Name | UPDATE_COMPLETE_CLEANUP_IN_PROGRESS
2021-10-20 11:22:34 UTC-0400 | Some ID | UPDATE_COMPLETE | -
2021-10-20 11:20:48 UTC-0400 | Some ID | UPDATE_IN_PROGRESS | -
2021-10-20 11:20:05 UTC-0400 | Stack Name | UPDATE_IN_PROGRESS | User Initiated

When error occured:

2021-10-20 12:35:35 UTC-0400 | Stack Name | UPDATE_ROLLBACK_COMPLETE | --- | -- | -- | --
2021-10-20 12:35:34 UTC-0400 | Some ID | DELETE_COMPLETE | -
2021-10-20 12:27:51 UTC-0400 | Some ID | DELETE_IN_PROGRESS | -
2021-10-20 12:24:48 UTC-0400 | Some ID | DELETE_FAILED | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: Some ID)
2021-10-20 12:14:38 UTC-0400 | Some ID | DELETE_IN_PROGRESS | -
2021-10-20 12:11:35 UTC-0400 | Some ID | DELETE_FAILED | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId:Some ID)
2021-10-20 12:01:23 UTC-0400 | Some ID | DELETE_IN_PROGRESS | -
2021-10-20 12:01:21 UTC-0400 | Stack Name | UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS | -
2021-10-20 12:01:20 UTC-0400 | Some ID | UPDATE_COMPLETE | -
2021-10-20 12:01:06 UTC-0400 | Stack Name | UPDATE_ROLLBACK_IN_PROGRESS | The following resource(s) failed to update: [Some ID].
2021-10-20 12:01:04 UTC-0400 | Some ID | UPDATE_FAILED | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: Some ID)
2021-10-20 12:01:04 UTC-0400 | Some ID | UPDATE_IN_PROGRESS | Requested update required the provider to create a new physical resource
2021-10-20 11:50:22 UTC-0400 | Some ID | UPDATE_IN_PROGRESS | -
2021-10-20 11:49:53 UTC-0400 | Stack Name | UPDATE_IN_PROGRESS | User Initiated

2021-10-20 12:01:04 UTC-0400 | Some ID | UPDATE_IN_PROGRESS | Requested update required the provider to create a new physical resource There was no infrastucture change done that would've triggered new resource creation. Yet, this showed up on all deployment that was triggered around this time frame.

otaviomacedo commented 2 years ago

From the CloudFront team:

CreateInvalidation API suffers from high fault rate during the daily traffic peaks. It will return faults upto 50% of the requests. It is primarily due to the limited capacity of the API.

and

We have evidence that some requests failed even after six retries , during the peak. We are working on improving this , but there is no quick fix for this and we are expecting it will get better by the end of Q1 2022.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

hugomallet commented 2 years ago

Hi,

On my side s3 deployment is very slow (often 10 minutes or more) for a quite lightweight static website. And sometime it fails like described here. The invalidation seems to be the cause as well.

So firstly, are there some news from the aws team about any improvement planned of the invalidation process ? (would be great 😊 )

And secondly, why the s3-deployment module wait the invalidation to complete ? Is it really necessary ?

adeelamin15 commented 2 years ago

Just FYI, I am still getting this issue


Failed resources:
my-stack | 5:54:23 AM | UPDATE_FAILED        | Custom::CDKBucketDeployment                     | my-frontend-DeployWebsite/CustomResource/Default (myfrontendDeployWebsiteCustomResourceB5AC9AF1) Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: be06c4ff-4c2f-42e5-bae6-7c5c42360b5f)
my-stack | 6:05:24 AM | UPDATE_FAILED        | Custom::CDKBucketDeployment                     | my-portal-DeployWebsite/CustomResource/Default (myportalDeployWebsiteCustomResource9B1D41C9) Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: 192c1bbe-385e-47f9-8beb-170de2c1c1df)```
crazyoptimist commented 2 years ago

I experienced the same issue today in our GHA workflow which uses this command:

aws cloudfront create-invalidation --distribution-id ${{ secrets.AWS_CLOUDFRONT_ID_STAGING }} --paths /\*

Re-ran the failed workflow and it's passed. But not sure why and when it happens. AWS status page tells me that Cloudfront service stays healthy. :-(

hlin-neo4j commented 2 years ago

I'm getting the same issue. I'm also trying it from the browser UI (AWS Management Console)

pdrprz commented 2 years ago

Also experienced this today and had some downtime because of it. Are there any updates on this? Or possible workarounds? Thanks.

mcramer-billgo commented 2 years ago

Also experienced this today and had some downtime because of it. Are there any updates on this? Or possible workarounds? Thanks.

Same here.

guptalav commented 2 years ago

I had this same issue recently in our pipeline.

ES-Six commented 2 years ago

Just got this error right now : An error occurred (ServiceUnavailable) when calling the CreateInvalidation operation (reached max retries: 4): CloudFront encountered an internal error. Please try again.

This happens randomly but it's rare (about 1 time per month).

I'm using latest version of AWS CLI.

emmanuelnk commented 1 year ago

We are also experiencing this issue intermittently with our cloudfront invalidations (once every two weeks or so) 😞

benjaminpottier commented 1 year ago

Is it possible to re-open this issue? We're experiencing this problem as well.

edwardofclt commented 1 year ago

@benjaminpottier this isn't a CDK problem. We're experiencing a similar issue using the Golang SDK...

operation error CloudFront: CreateInvalidation, exceeded maximum number of attempts, 3, https response error StatusCode: 503
benjaminpottier commented 1 year ago

@benjaminpottier this isn't a CDK problem. We're experiencing a similar issue using the Golang SDK...

operation error CloudFront: CreateInvalidation, exceeded maximum number of attempts, 3, https response error StatusCode: 503

Sorry I realize that now. For anyone else, we increased the AWS_MAX_ATTEMPTS environment variables value and haven’t seen the issue since. Though don’t know for sure if it was the solution.

edwardofclt commented 1 year ago

To quote AWS support:

Some of our customers when attempting to invalidate content are experiencing error in form of “Rate exceeded” exceptions or API faults with errorCode ServiceUnavailableException with CreateInvalidation API. End-user requests for content is not affected by this issue, and content from our edge locations continues to be served normally. This is a known issue and we are exploring longer term solution to this issue. In the interim we recommend customers to implement a retry mechanism with exponential backoff.

LosD commented 1 year ago

@otaviomacedo Did you ever get an update from them? Just ran into this (also once at deploy, once at rollback), and it's a major PITA.

We have evidence that some requests failed even after six retries , during the peak. We are working on improving this , but there is no quick fix for this and we are expecting it will get better by the end of Q1 2022.

benjaminpottier commented 1 year ago

This issue got worse for us so this is our solution for now:

    const createInvalidation = new sfnTasks.CallAwsService(this, 'CreateInvalidation', {
      service: 'cloudfront',
      action: 'createInvalidation',
      parameters: {
        DistributionId: distribution.distributionId,
        InvalidationBatch: {
          CallerReference: sfn.JsonPath.entirePayload,
          Paths: {
            Items: ['/*'],
            Quantity: 1,
          },
        },
      },
      iamResources: [
        `arn:aws:cloudfront::${Aws.ACCOUNT_ID}:distribution/${distribution.distributionId}`,
      ],
    });

    const createInvalidationStateMachine = new sfn.StateMachine(
      this,
      'CreateInvalidationStateMachine',
      {
        definition: createInvalidation.addRetry({
          errors: ['CloudFront.CloudFrontException'],
          backoffRate: 2,
          interval: Duration.seconds(5),
          maxAttempts: 10,
        }),
      }
    );

    new events.Rule(this, 'DeploymentComplete', {
      eventPattern: {
        source: ['aws.cloudformation'],
        detail: {
          'stack-id': [`${Stack.of(this).stackId}`],
          'status-details': {
            status: ['UPDATE_COMPLETE'],
          },
        },
      },
    }).addTarget(
      new eventsTargets.SfnStateMachine(createInvalidationStateMachine, {
        input: events.RuleTargetInput.fromEventPath('$.id'),
      })
    );
  }
msheiny commented 1 year ago

Can we re-open this issue? It's still a problem with the underlying lambda even if its related to another service. What if we provide an option to not fail the custom resource if the invalidation fails?

ES-Six commented 1 year ago

Note : I observed, CloudFront invalidations are more likely to fail sometimes when there are lots of previous invalidation for the same CloudFront distribution.

These kinds of errors may be due to CloudFront API internaly encounter a timeout when reading all invalidations.

A solution for AWS might be to increase the retry or timeout time to these kinds of API calls, or in the case of aws-cli, retry more than 3 times and wait X seconds between retries.

emmapatterson commented 1 year ago

My team are also seeing this error regularly!

Negan1911 commented 1 year ago

Started to see this problem when using s3 bucket deployments with CDK

jkbailey commented 1 year ago

We started seeing this, it started on 4/19/23, still happening today 4/20

calebwilson706 commented 1 year ago

This is happening us frequently now also

miekassu commented 1 year ago

This is happening more frequently now

costleya commented 1 year ago

Indeed the cache invalidates on the CloudFront side almost instantly. But the deploy fails, and rolls back (which CloudFront also takes affect immediately, and the rollback fails).

nkeysinstil commented 1 year ago

Seeing this now also

leantorres73 commented 1 year ago

Same here...

xli2227 commented 1 year ago

encounter the same issue, some action log timestamps:

2023-06-09 08:15:11 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_FAILED | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: 3b01a325-6c24-45f0-8f6c-86638f2e282b)
-- | -- | -- | --
2023-06-09 08:04:38 UTC-0700 | AgenticConsoleawsgammauseast1consolestackbucketdeploymentCustomResource9C0F1745 | UPDATE_IN_PROGRESS | -

took 10m to failed the CDK stack, and the invalidation was created 1 min after the failure.

  | IEKSZWOI5U3Q6GNNNQMQLJ11WH | Completed | June 9, 2023 at 3:16:20 PM UTC
JonWallsten commented 10 months ago

I've just seen this the first time today. But in my case the invalidation is actually not complete: image It's been going on for 19 minutes now. I have a single origin: A S3 bucket with three files on it. It just failed the deploy for the third time in a row. image

hugomallet commented 10 months ago

It seems there's currently a problem in AWS cloudfront I get the same timeouts errors

nbeag commented 10 months ago

we are also encountering this intermittently in one of our CDK stacks and have noticed it happen more frequently in the last few weeks. when it occurs, the stack initiates a rollback - sometimes this fails (and requires manual intervention) and sometimes the rollback succeeds. Any update/workaround would be appreciated

abury commented 9 months ago

Started seeing this regularly today as well Edit: Seeing this almost every day around the same time? I'm not even sure we can use Cloudfront going forward if we can't reliably deploy

mattiLeBlanc commented 9 months ago

I am getting same error all of a sudden in our Staging deployment via Bitbucket:

UPDATE_FAILEDLikely root cause | Received response status [FAILED] from custom resource. Message returned: Waiter InvalidationCompleted failed: Max attempts exceeded (RequestId: dcd7fbdb-d6b7-441f-96f1-08026063b052)

This is a cloudfront deployment. I tried to deploy a deployment of 4 days ago which was fine and that also fails. It happens at:

image

Our Dev and Prod deployments are working fine (different accounts)

This is totally unacceptable because I think I need to delete my stack, which luckily I can because of our microservice approach, but again, totally unacceptable.