eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.92k stars 1.41k forks source link

Race condition when updating log retention policy #4454

Open artem-nefedov opened 2 years ago

artem-nefedov commented 2 years ago

While using new "logRetentionInDays" field added in eksctl 0.73.0, we sometimes randomly observe the error during cluster creation:

[✔]  configured CloudWatch logging for cluster "test-cluster" in "us-west-2" (enabled types: audit, authenticator, scheduler & disabled types: api, controllerManager)
[!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
[ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=test-cluster'
[✖]  error updating log retention settings: ResourceNotFoundException: The specified log group does not exist.

The reproduction rate isn't too high, but high enough to be the problem (around 10%). It seems that there's a race condition in play here.

cPu1 commented 2 years ago

Interesting find. We do wait for the UpdateClusterConfig operation to complete before issuing a call to logs:PutRetentionPolicy but looks like that does not ensure the log group for the control plane is created. We'll investigate this and get back to you soon.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.

cPu1 commented 2 years ago

Not stale, this still needs to be resolved.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Himangini commented 2 years ago

Not stale, this still needs to be resolved.

@cPu1 what is required to resolve this. Are we waiting for anything from aws?

hgrant-ebsco commented 2 years ago

Running into this problem as well with v0.95.0