Open pqueixalos opened 3 years ago
I was asked to raise a feature request by a member of the containers team with regards to a support case we have. However, as the solution proposed by this issue would solve our problem too, I thought it would be better to add additional context to this issue rather than creating a new one.
For context, we have worked around this issue by allocating a fixed amount of cpu to our tasks in order to match the amount of ENI trunking capacity on the ECS instances. For example; an m5.xlarge supports 20 branch ENI’s and 4096 cpu units, so we’ve allocated 204 cpu units to each task (4096−(20×204) = 16
) in order to avoid allocating too many tasks to the same instance. However, this comes with downsides:
With regards to point 3, we have identified a problem when we have concurrent (i.e. two or more, but the chance increases with each additional concurrent deployment) ECS services performing deployments in the same ECS cluster whereby tasks will fail due to tasks being placed on ECS instances which do not have any available ENI trunk capacity:
Unexpected EC2 error while attempting to associate branch interface to trunk interface: AssociationLimitExceeded
This appears to be due to the same problem as described in this issue; that the ECS scheduler does not take ENI capacity into account when performing its simulation for where to place the task.
Our hypothesis is that this issue is fairly common in large-scale ECS clusters that can see multiple concurrent deployments, but due to the ECS eventual consistency model it is quite difficult to observe unless:
stoppedReason
attribute for tasks as they stopWe feel that this leads to a poor user experience, as at best it means deployments take longer to complete as they become eventually consistent, and at worst the circuit breaker will trigger for task placement failures that are out of the customer’s control.
Community Note
Tell us about your request When using
awsvpc
networking type ECS scheduler does not take into account task limit of the EC2 nodes in terms of ENI allocation when scheduling tasks. So it often happens that tasks are failing to start and are delayed to be up ; stoppedReason beingUnable to attach network interface to unused device index.
.I guess this could also happen when using eni trunking.
Which service(s) is this request for? ECS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? EC2 container instances are limited by cpu/memory and eni but the latest is not used for agent election by ECS scheduler.
Are you currently working around this issue? Considering switching to eni trunking to mitigate the issue.