Open brentryan opened 5 years ago
We are facing this issue as well. When scaling up a bunch of tasks get scheduled before the daemons, causing the instance to fill up, resulting in there not being enough CPU or memory for the daemon.
We use a daemon for log parsing / forwarding, so this is quite a big issue for us.
We are currently working on a daemon scheduler enhancements that will resolve the issue as defined above: All customers will get the enhancements out of the box:
Please feel free to provide feedback on the Github Issue here. Hope this helps!
the above sounds great @pavneeta !
@pavneeta Is there any way to add a feature like this :
What do you think about it ?
@kapralVV I suspect this starts to dive into the territory of https://github.com/aws/containers-roadmap/issues/105 ?
+1 to think about what happens when adding a new daemon task to an existing cluster - that seems to be the main case not handled in @pavneeta's update above. This sounds great though!
Edit: If moving tasks off of the instance to make room is difficult, I wonder if marking an instance as unhealthy if there aren't enough resources to run the daemon, drain it, launch a new instance w/ the new resource reservations, and reschedule there.
We are currently working on a daemon scheduler enhancements that will resolve the issue as defined above: All customers will get the enhancements out of the box:
1. ECS will ensure that Daemon tasks are the first tasks to be placed on new ECS container instances to ensure that monitoring and security agents are launched before the application containers are launched on the container instance. 2. ECS will also reserve the CPU, memory and ENI resources defined for the daemon task on the Instance. This will ensure that in case of daemon launch failure or during daemon service updates,another task launch does not ‘steal’ the resources for daemon task and prevent it from Running successfully.
Please feel free to provide feedback on the Github Issue here. Hope this helps!
Is there any update on this feature set (is it still in the pipeline?) I have very little constructive feedback to add beyond "LGTM" for the two points provided, these would completely alleviate existing issues with daemon
services not being present on hosts during aggressive scaling.
There are other suggestions here to take this further into the realms of rebalancing which I would certainly support and appreciate, but just addressing these first two points in a "dumb" manner at the time of instance initialization (i.e. without worrying about handling the "adding a new daemon" case above) is an extremely useful step that delivers immediate value short of any of the rebalancing/#105 realm of feature requests, which are presumably harder to deliver.
+1
There is another bug (which is hopefully more limited) - changing resource allocation for a Service Daemon.
If a daemon service is updated to need more memory/cpu, there is a failure state if the container instance does not have the required allocation left.
As tested - we can monitor and see the old version running, but when the deployment reaches that instance it will stop the old version of the daemon and then fail to start the new version.
Possible Solutions:
Curious on further thoughts here.
We're having exactly the issue @billalley is describing, where it's impossible to safely change the resource reservations for a daemon service. We ended up having to work around it by having a lambda subscribed to ECS service task placement failure events, filter out everything except daemon services, and then draining any container instance where a daemon task failed to be placed. It would be far preferable if the scheduler moved some replica tasks off the container instance to make room for the daemon task.
This is still really painful for us. Please help.
Any update on this?
This is driving me insane.
Last update was 'We are currently working on a daemon scheduler enhancements that will resolve the issue', over 3 years ago. Any news?
The docs literally say, 'When you launch a daemon service on a cluster with other replica services, Amazon ECS prioritizes the daemon task. This means that the daemon task is the first task to launch on the instances and the last task to stop. This strategy ensures that resources aren't used by pending replica tasks and are available for the daemon tasks.', which is clearly incorrect.
Tell us about your request What do you want us to build? Currently when you use DAEMON tasks you can get into situations where the task cannot be scheduled because there isn't enough CPU/Memory available on the instance. However, this is critical when you want to run something with 1 DAEMON task per host for things like log aggregation, datadog agent, etc.
I think we need something like DAEMON_MEMORY_RESERVATION_MB/DAEMON_CPU_RESERVATION that we can populate to reserve this space so that ECS can still schedule these tasks.
Which service(s) is this request for? This could be Fargate, ECS, EKS, ECR ECS
Are you currently working around this issue? How are you currently solving this problem? The only work around I'm aware of is to ensure your instances have plenty of memory/cpu which causes you to over provision your cluster and cost us more.