Open naseemkullah opened 3 years ago
After not encountering this problem for a while, we're now also having this issue again. Luckily, it happened in our dev account, but I'm hesitant about deploying it to production.
We are also experiencing this issue. The Lambda successfully uploads all the files to S3, however it does not complete and results in a timeout error. The other strange thing, for the latest Lambda invocation that have timed out, I don't see any Cache invalidation in the CloudFront Distribution.
These are the Lambda logs:
[INFO] 2023-12-11T19:31:34.655Z 181e927e-a970-43bc-a974-d88e6761c4cc | aws s3 sync /tmp/tmp9zlftjbn/contents s3://notebooks/
Completed 8.6 KiB/~9.5 KiB (70.6 KiB/s) with ~3 file(s) remaining (calculating...)
upload: ../../tmp/tmp9zlftjbn/contents/error/403.html to s3://notebooks/error/403.html
Completed 8.6 KiB/~9.5 KiB (70.6 KiB/s) with ~2 file(s) remaining (calculating...)
Completed 9.1 KiB/~9.5 KiB (14.2 KiB/s) with ~2 file(s) remaining (calculating...)
upload: ../../tmp/tmp9zlftjbn/contents/index.html to s3://notebooks/index.html
Completed 9.1 KiB/~9.5 KiB (14.2 KiB/s) with ~1 file(s) remaining (calculating...)
Completed 9.5 KiB/~9.5 KiB (5.2 KiB/s) with ~1 file(s) remaining (calculating...)
upload: ../../tmp/tmp9zlftjbn/contents/error/404.html to s3://notebooks/error/404.html
Completed 9.5 KiB/~9.5 KiB (5.2 KiB/s) with ~0 file(s) remaining (calculating...)
2023-12-11T19:45:36.537Z 181e927e-a970-43bc-a974-d88e6761c4cc Task timed out after 900.17 seconds
END RequestId: 181e927e-a970-43bc-a974-d88e6761c4cc
REPORT RequestId: 181e927e-a970-43bc-a974-d88e6761c4cc Duration: 900171.11 ms Billed Duration: 900000 ms
Just hit this. CDK deployement. RequestId: 552880ea-f37b-4b8b-8cc8-3772e52e4cd3
Still happening in 2024.... Not sure why I'm using Cloudfront at this point...
Same here. What can you guys propose to prevent such issues in production pipelines?
Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.***> wrote: Same here. What can you guys propose to prevent such issues in production pipelines?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>
Hey man. What do you mean? Where? The cloud formation deployment service fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that)
Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right?
We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform.
My team chose to swap back to Terraform because Cloudformation is.. not great. But we would basically run the cdk stack twice if it failed the first time. It wasn't a good experience for us. Mind you, this issue is not unique to CDK. This is an issue with CloudFront at the end of the day.Sent from my iPhoneOn Feb 28, 2024, at 3:04 AM, Alexandr Cherednichenko @.***> wrote:
Add retry logic. Sent from my iPhoneOn Feb 27, 2024, at 1:00 PM, Alexandr Cherednichenko @.> wrote: Same here. What can you guys propose to prevent such issues in production pipelines? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>
Hey man. What do you mean? Where? The cloud formation deployment stack fails with the state "UPDATE_ROLLBACK_FAILED". All we can do is wait and then do "continue update rollback" in the UI. (I guess there must be an API command for that) Why don't AWS add the retry? We are using standard CDK lib - its core building blocks must just work, right? We enjoy having the cloud and application code in the same language in the same monorepo. (typescript + cdk + nx in our case) but such problems make us think about migrating to Terraform.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
We no longer experience this issue after increasing the memory limit of the bucket deployment.
new BucketDeployment(this, 'website-deployment', {
...config,
memoryLimit: 2048
})
The defalut memory limit is 128. (docs)
We no longer experience this issue after increasing the memory limit of the bucket deployment.
new BucketDeployment(this, 'website-deployment', { ...config, memoryLimit: 2048 })
The defalut memory limit is 128. (docs)
I'm pretty sure that's coincidental. First of all, it is VERY random; It can easily be 30-40 deployments between the issue, then suddenly happen multiple times within a few days. Second, the issue seems to be the CloudFront API itself timing out or taking so long that the bucketdeployment times out.
Only pattern I've seen is that it seem to happen more often if we deploy at the end of day (CET).
We no longer experience this issue after increasing the memory limit of the bucket deployment.
new BucketDeployment(this, 'website-deployment', { ...config, memoryLimit: 2048 })
The defalut memory limit is 128. (docs)
I'm pretty sure that's coincidental. First of all, it is VERY random; It can easily be 30-40 deployments between the issue, then suddenly happen multiple times within a few days. Second, the issue seems to be the CloudFront API itself timing out or taking so long that the bucketdeployment times out.
Only pattern I've seen is that it seem to happen more often if we deploy at the end of day (CET).
Somewhere in the last 2 years the devs said this was an issue internal to Cloudfront and they were working with that team on it. That was a long time ago.
Abandon all hope ye who enter here.
Still facing this issue in June 2024.
I have also seen the issue. I see 2 completed invalidations: 1st at the time when deploy starts and 2nd at the time of rollback (successful).
I have also seen the issue. I see 2 completed invalidations: 1st at the time when deploy starts and 2nd at the time of rollback (successful).
I've been getting notifications for this issue since 2021, I wouldn't hold your breath and perhaps implement retry logic.
https://github.com/aws/aws-cdk/blob/beb01b549abc5a5c825e951649786589d2150a72/packages/%40aws-cdk/aws-s3-deployment/lib/lambda/index.py#L150-L163
I've come across a deployment where cloudfront was invalidated but the lambda timed out with
cfn_error: Waiter InvalidationCompleted failed: Max attempts exceeded
. ~I suspect a race conditon, and that reversing the order ofcloudfront.create_invalidation()
andcloudfront.get_waiter()
would fix this race condition.~edit: proposed fix of reversing create_invalidation() and get_waiter() is invalid, see https://github.com/aws/aws-cdk/issues/15891#issuecomment-898413309