aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.19k stars 315 forks source link

[ECS/Fargate][bug]: Unable to start new tasks on platform 1.4.0 suddenly #1425

Open rpurdon-nf opened 3 years ago

rpurdon-nf commented 3 years ago

Community Note

Tell us about your request

We have been running a few ECS services on fargate using platform 1.4.0 for the past ~3months now with no problem, expect for yesterday one of our ECS services has been unable to start any new tasks on platform 1.4.0. Can see the error below each tasks throughs when trying to reach RUNNING state.

(CannotCreateVolumeError: unable to copy contents to volume on host: containerd: failed to get image reference: context deadline exceeded)

TO NOTE The service in question does not makes use of any volume sharing between containers and none of our dockerfiles expose VOLUME paths either.

Which service(s) is this request for? Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I am trying to get our tasks to run on platform 1.4.0 again, but they keep on crashing when trying to start with the error as describe above. I have moved our platform version back to 1.3.0 for the service affected and that has allowed us to start the tasks again.

I updated the version again to 1.4.0, but I can watch the tasks just failing and trying to start again continuously. One task was able to start on 1.4.0 again, but none of the others could. So I am currently running a combination of 2 tasks on 1.3.0 and 1 on 1.4.0.

I have tested the same scenario as above on our staging aws account, that is a replica of our production environment, and I cannot replicate the error. I can switch between platform version as much as I like on our staging account with new tasks starting successfully each time. Which makes me believe this is either a bug with some sort of release recently, or our service is running on faulty infrastructure.

Are you currently working around this issue? Yes, we are running our production service on platform 1.3.0

Additional context Have added attachments of task definition and some screenshots of the current tasks running & stopped.

Attachments

taskdefinition.txt

Screenshot 2021-07-01 at 13 54 17

Screenshot 2021-07-01 at 13 54 47

henriquesantanati commented 1 year ago

Hello @rpurdon-nf ,

Thanks for raising this and all detailed information.

Could you please check if it is related to the VOLUME directive inside of the Dockerfile. There is another issue explaining this further.