[ECS/Fargate][bug]: Unable to start new tasks on platform 1.4.0 suddenly

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

We have been running a few ECS services on fargate using platform 1.4.0 for the past ~3months now with no problem, expect for yesterday one of our ECS services has been unable to start any new tasks on platform 1.4.0. Can see the error below each tasks throughs when trying to reach RUNNING state.

(CannotCreateVolumeError: unable to copy contents to volume on host: containerd: failed to get image reference: context deadline exceeded)

TO NOTE The service in question does not makes use of any volume sharing between containers and none of our dockerfiles expose VOLUME paths either.

Which service(s) is this request for? Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I am trying to get our tasks to run on platform 1.4.0 again, but they keep on crashing when trying to start with the error as describe above. I have moved our platform version back to 1.3.0 for the service affected and that has allowed us to start the tasks again.

I updated the version again to 1.4.0, but I can watch the tasks just failing and trying to start again continuously. One task was able to start on 1.4.0 again, but none of the others could. So I am currently running a combination of 2 tasks on 1.3.0 and 1 on 1.4.0.

I have tested the same scenario as above on our staging aws account, that is a replica of our production environment, and I cannot replicate the error. I can switch between platform version as much as I like on our staging account with new tasks starting successfully each time. Which makes me believe this is either a bug with some sort of release recently, or our service is running on faulty infrastructure.

Are you currently working around this issue? Yes, we are running our production service on platform 1.3.0

Additional context Have added attachments of task definition and some screenshots of the current tasks running & stopped.

Attachments

taskdefinition.txt

Screenshot 2021-07-01 at 13 54 47

aws / containers-roadmap

[ECS/Fargate][bug]: Unable to start new tasks on platform 1.4.0 suddenly #1425

Community Note