aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.66k stars 3.92k forks source link

aws-s3-deployment - intermittent cloudfront "Waiter InvalidationCompleted failed" error #15891

Open naseemkullah opened 3 years ago

naseemkullah commented 3 years ago

https://github.com/aws/aws-cdk/blob/beb01b549abc5a5c825e951649786589d2150a72/packages/%40aws-cdk/aws-s3-deployment/lib/lambda/index.py#L150-L163

I've come across a deployment where cloudfront was invalidated but the lambda timed out with cfn_error: Waiter InvalidationCompleted failed: Max attempts exceeded. ~I suspect a race conditon, and that reversing the order of cloudfront.create_invalidation() and cloudfront.get_waiter() would fix this race condition.~

edit: proposed fix of reversing create_invalidation() and get_waiter() is invalid, see https://github.com/aws/aws-cdk/issues/15891#issuecomment-898413309

MrDark commented 11 months ago

After not encountering this problem for a while, we're now also having this issue again. Luckily, it happened in our dev account, but I'm hesitant about deploying it to production.

alechewitt commented 11 months ago

We are also experiencing this issue. The Lambda successfully uploads all the files to S3, however it does not complete and results in a timeout error. The other strange thing, for the latest Lambda invocation that have timed out, I don't see any Cache invalidation in the CloudFront Distribution.

These are the Lambda logs:

[INFO]  2023-12-11T19:31:34.655Z    181e927e-a970-43bc-a974-d88e6761c4cc    | aws s3 sync /tmp/tmp9zlftjbn/contents s3://notebooks/
Completed 8.6 KiB/~9.5 KiB (70.6 KiB/s) with ~3 file(s) remaining (calculating...)
upload: ../../tmp/tmp9zlftjbn/contents/error/403.html to s3://notebooks/error/403.html
Completed 8.6 KiB/~9.5 KiB (70.6 KiB/s) with ~2 file(s) remaining (calculating...)
Completed 9.1 KiB/~9.5 KiB (14.2 KiB/s) with ~2 file(s) remaining (calculating...)
upload: ../../tmp/tmp9zlftjbn/contents/index.html to s3://notebooks/index.html
Completed 9.1 KiB/~9.5 KiB (14.2 KiB/s) with ~1 file(s) remaining (calculating...)
Completed 9.5 KiB/~9.5 KiB (5.2 KiB/s) with ~1 file(s) remaining (calculating...) 
upload: ../../tmp/tmp9zlftjbn/contents/error/404.html to s3://notebooks/error/404.html
Completed 9.5 KiB/~9.5 KiB (5.2 KiB/s) with ~0 file(s) remaining (calculating...)
2023-12-11T19:45:36.537Z 181e927e-a970-43bc-a974-d88e6761c4cc Task timed out after 900.17 seconds

END RequestId: 181e927e-a970-43bc-a974-d88e6761c4cc
REPORT RequestId: 181e927e-a970-43bc-a974-d88e6761c4cc  Duration: 900171.11 ms  Billed Duration: 900000 ms
richard-collette-precisely commented 8 months ago

Just hit this. CDK deployement. RequestId: 552880ea-f37b-4b8b-8cc8-3772e52e4cd3

abury commented 8 months ago

Still happening in 2024.... Not sure why I'm using Cloudfront at this point...

alexandr2110pro commented 8 months ago

Same here. What can you guys propose to prevent such issues in production pipelines?

edwardofclt commented 8 months ago

Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.***> wrote: Same here. What can you guys propose to prevent such issues in production pipelines?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

alexandr2110pro commented 8 months ago

Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

Hey man. What do you mean? Where? The cloud formation deployment service fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that)

Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right?

We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform.

edwardofclt commented 8 months ago

My team chose to swap back to Terraform because Cloudformation is.. not great. But we would basically run the cdk stack twice if it failed the first time. It wasn't a good experience for us. Mind you, this issue is not unique to CDK. This is an issue with CloudFront at the end of the day.Sent from my iPhoneOn Feb 28, 2024, at 3:04 AM, Alexandr Cherednichenko @.***> wrote:

Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

Hey man. What do you mean? Where? The cloud formation deployment stack fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that) Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right? We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

jkbailey commented 8 months ago

We no longer experience this issue after increasing the memory limit of the bucket deployment.

new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})

The defalut memory limit is 128. (docs)

LosD commented 8 months ago

We no longer experience this issue after increasing the memory limit of the bucket deployment.

new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})

The defalut memory limit is 128. (docs)

I'm pretty sure that's coincidental. First of all, it is VERY random; It can easily be 30-40 deployments between the issue, then suddenly happen multiple times within a few days. Second, the issue seems to be the CloudFront API itself timing out or taking so long that the bucketdeployment times out.

Only pattern I've seen is that it seem to happen more often if we deploy at the end of day (CET).

sblackstone commented 8 months ago

We no longer experience this issue after increasing the memory limit of the bucket deployment.

new BucketDeployment(this, 'website-deployment', {
  ...config,
  memoryLimit: 2048
})

The defalut memory limit is 128. (docs)

I'm pretty sure that's coincidental. First of all, it is VERY random; It can easily be 30-40 deployments between the issue, then suddenly happen multiple times within a few days. Second, the issue seems to be the CloudFront API itself timing out or taking so long that the bucketdeployment times out.

Only pattern I've seen is that it seem to happen more often if we deploy at the end of day (CET).

Somewhere in the last 2 years the devs said this was an issue internal to Cloudfront and they were working with that team on it. That was a long time ago.

Abandon all hope ye who enter here.

pardeepdhingra commented 5 months ago

Still facing this issue in June 2024.

ashellunts commented 2 months ago

I have also seen the issue. I see 2 completed invalidations: 1st at the time when deploy starts and 2nd at the time of rollback (successful).

sblackstone commented 2 months ago

I have also seen the issue. I see 2 completed invalidations: 1st at the time when deploy starts and 2nd at the time of rollback (successful).

I've been getting notifications for this issue since 2021, I wouldn't hold your breath and perhaps implement retry logic.