aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[ECS] [request]: Support ephemeral storage limits for EC2 launch type #2442

Open Nevon opened 1 month ago

Nevon commented 1 month ago

Community Note

Tell us about your request Allow limiting the ephemeral storage used by a task through the task definition, similar to how you can limit the amount of memory available to a task.

Which service(s) is this request for? ECS (EC2)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? My team operates large multi-tenant clusters. Because there's no limit on the amount of ephemeral space each individual task can consume, it sometimes happens that a single task ends up consuming lots of disk space by excessively writing crash dumps or logs or something. Because we don't have any insight into how much disk space has been used by an individual task, we have no choice in these cases but terminate the entire container instance rather than just evicting the offending task. This affects all the other workloads running on the same instance, but is better than running out of disk space and having all workloads stop working.

If instead we could impose a limit on the amount of ephemeral storage that is used by a task, and stop it if it exceeds the limit, we would not have to impact other "innocent" workloads and we would not need to maintain automation to drain container instances that are starting to run out of free disk space (which by itself is also not all that simple, since EBS doesn't provide metrics for this).

Are you currently working around this issue? We have to collect and publish disk utilization metrics to our observability platform, create monitors that trigger when the available space on the disk reaches a certain threshold and trigger a container instance draining process that terminates the container instance after relocating all the running tasks.

Additional context The equivalent functionality in kubernetes is ephemeral storage limits & requests. I am personally mostly interested in the limits, rather than the requests, although both would of course be useful to avoid placing a new task on a host that doesn't have enough available ephemeral space.