aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 317 forks source link

[ECS] [request]: add available ENI slots in container instance election for task provisioning #1545

Open pqueixalos opened 2 years ago

pqueixalos commented 2 years ago

Community Note

Tell us about your request When using awsvpc networking type ECS scheduler does not take into account task limit of the EC2 nodes in terms of ENI allocation when scheduling tasks. So it often happens that tasks are failing to start and are delayed to be up ; stoppedReason being Unable to attach network interface to unused device index..

I guess this could also happen when using eni trunking.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? EC2 container instances are limited by cpu/memory and eni but the latest is not used for agent election by ECS scheduler.

Are you currently working around this issue? Considering switching to eni trunking to mitigate the issue.

daveygit2050 commented 3 months ago

I was asked to raise a feature request by a member of the containers team with regards to a support case we have. However, as the solution proposed by this issue would solve our problem too, I thought it would be better to add additional context to this issue rather than creating a new one.

For context, we have worked around this issue by allocating a fixed amount of cpu to our tasks in order to match the amount of ENI trunking capacity on the ECS instances. For example; an m5.xlarge supports 20 branch ENI’s and 4096 cpu units, so we’ve allocated 204 cpu units to each task (4096−(20×204) = 16) in order to avoid allocating too many tasks to the same instance. However, this comes with downsides:

  1. We cannot realistically declare how much cpu is required by tasks with differing requirements
  2. We cannot add instance types with different ENI trunking capacity to our ECS cluster via capacity providers
  3. We still see problems with task placement failing due to ENI trunk capacity issues, especially during periods of high deployment activity

With regards to point 3, we have identified a problem when we have concurrent (i.e. two or more, but the chance increases with each additional concurrent deployment) ECS services performing deployments in the same ECS cluster whereby tasks will fail due to tasks being placed on ECS instances which do not have any available ENI trunk capacity:

Unexpected EC2 error while attempting to associate branch interface to trunk interface: AssociationLimitExceeded

This appears to be due to the same problem as described in this issue; that the ECS scheduler does not take ENI capacity into account when performing its simulation for where to place the task.

Our hypothesis is that this issue is fairly common in large-scale ECS clusters that can see multiple concurrent deployments, but due to the ECS eventual consistency model it is quite difficult to observe unless:

  1. The circuit breaker is enabled for the ECS services, so multiple task placement failures lead to an easily observable deployment failure
  2. You subscribe to ECS task state change events and log/observe the stoppedReason attribute for tasks as they stop

We feel that this leads to a poor user experience, as at best it means deployments take longer to complete as they become eventually consistent, and at worst the circuit breaker will trigger for task placement failures that are out of the customer’s control.