[ECS] [request]: Prioritize Daemon task scheduling above Replica tasks

ericdahl commented 5 years ago

Tell us about your request When our Replica ECS Services scale up and we launch new ECS Container Instances (from an ASG, scaling on *Reservation metrics), sometimes the Replica tasks are launched on the instance fast enough that our Daemon Services do not have a chance to launch on these new hosts. If these replicas use enough CPU/memory, there may not be room for the Daemon services to run.

For example, we have a few daemon services to collect host-level metrics and forward log files. We want these to run on every host. Periodically we see that hosts have been saturated with Replica tasks and there's no room for the Daemon tasks. This means we lack monitoring and visibility into these hosts.

Which service(s) is this request for? This could be ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Every host should have Daemon tasks provisioned on it, reliably.

Are you currently working around this issue? We manually go into the ECS console, review all the (possibly hundreds) of hosts to identify which one is missing Daemon services, then run "Stop Task" for one of the replica tasks on that host in order for Daemons tasks to have room to launch.

CpuID commented 5 years ago

Just got bitten again by this today (has happened more than once). had to go find some tasks to kill off to make room. Then the issue is the Daemon service wouldn't "retry" quickly enough to fill the void on that host, and other tasks would get binpacked in there...

pavneeta commented 5 years ago

@ericdahl @CpuID Thank you so much for your valuable feedback. We are currently in the middle of scoping out the solution to this known problem.

CpuID commented 5 years ago

Just ran into this again on a deploy of a daemon service (a log ingest process - filebeat), had to stop a bunch of other tasks to make room...

efenderbosch commented 4 years ago

Is there a projected timeline for this? Or is this:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html

a viable alternative?

edify42 commented 4 years ago

The ECS agent is run as a docker container as makes it onto every EC2 instance. A work around therefore would be to start your platform critical daemon processes without an ECS task definition and instead, do it in your user-data.txt.

You'd need a wrapper to ensure the container is always running and restarted if a rudimentary health check fails among other things (like a way to insert secrets to the docker containers).

I sure do wish the ecs-agent team would prioritise this as it seems pretty obvious that Daemons are important processes.

CpuID commented 4 years ago

The ECS agent is run as a docker container as makes it onto every EC2 instance. A work around therefore would be to start your platform critical daemon processes without an ECS task definition and instead, do it in your user-data.txt.

yea its possible to start things prioritized outside of the ECS ecosystem, but you then need to reserve resources in the ECS agent, and deploys of new versions of daemon services are still a PITA (you need to replace the EC2 instances, instead of just do a daemon service deploy). a step backwards overall ;) (probably what we all did before daemon services were a thing)

I sure do wish the ecs-agent team would prioritise this as it seems pretty obvious that Daemons are important processes.

+100 - the ideal goal here

mwarkentin commented 4 years ago

@pavneeta any updates on where this is at?

davinod commented 4 years ago

+1 for this one

ctcherry commented 4 years ago

+1 We had to wrestle with this on a dense ECS cluster today, would be great to see this baked in!

akrymets commented 3 years ago

Hi, guys! Any updates on this topic? It is still actual. Thanks!

toricls commented 3 years ago

Hi @akrymets, thanks for the comment!

Here is the latest update announcement we published in May regarding ECS daemon scheduling improvements!

https://aws.amazon.com/blogs/containers/improving-daemon-services-in-amazon-ecs/

alexpcoleman commented 2 years ago

Any updates on this? It still occurs quite regularly.

connorcartwright commented 2 years ago

Hey @toricls

I was wondering if there were any updates on this as whilst the improvements you mentioned above are great to hear, we would be looking for complete reliability in the placement of Daemon tasks.

To rely on Daemon tasks to, for example, deploy logging agents alongside application tasks across hundreds, to potentially thousands of instances, we would need to have a very high level of confidence that the Daemon tasks would always be placed.

Thanks!

toricls commented 2 years ago

Hey @connorcartwright, thanks for your interest and feedback on this!

We know how this is important for our customers who're willing to use Daemon type tasks to achieve reliable operations, and actually we've been having continuous conversations on this topic in the team, including last week.

Unfortunately there's nothing I can share about its concrete progress at this moment here, but will get you updated as soon as we got meaningful progress on this!

ianvernon commented 2 years ago

I'm curious as to whether the following could be a workaround for this:

add an attribute to all ECS instances which are started.
have all services configured to use placement constraints to not allow them to run on instances with said attribute
allow DAEMON tasks to run on the instances with said attribute. the DAEMON task would then be able to remove the attribute of the instance when it is ready and initialized, thereby untainting it, so that service can run.

The con here is that this requires your task to access the AWS API, and I'm not sure how it would affect scaling / capacity providers if there is a scale-out event.

sdpoueme commented 2 years ago

Hello team, do we have a release timeline for this feature request ?

sopeters commented 2 years ago

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html#service_scheduler_daemon

states:

Amazon ECS reserves container instance compute resources including CPU, memory, and network interfaces for the daemon tasks. When you launch a daemon service on a cluster with other replica services, Amazon ECS prioritizes the daemon task. This means that the daemon task is the first task to launch on the instances and the last task to stop. This stategy ensures that resources aren't used by pending replica tasks and are available for the daemon tasks.

Have there been any recent changes that address this issues item?

turacma commented 1 year ago

Just encountered this issue myself, so clearly still an issue. Is there an update regarding this issue? The documentation does not match the behavior, so this no longer seems like a request and should now be considered a bug.

immanetize commented 1 year ago

As someone who is regularly encounters this issue, I am looking forward to moving workload to EKS where I can use priority classes. The scheduling calculation to ensure daemonset tasks are scheduled first as in the literal request does seem arduous, but I'm OK with evicting lower priority replica tasks whenever the scheduler gets around to placing daemonset tasks.

mcfadden commented 1 year ago

We still encounter this regularly. Especially during large scale up events.

pwrmiller commented 11 months ago

We encounter this during scale up events (especially when there are also daemon service updates to deploy in the same change set). Would appreciate the ECS team responding if possible.

balexx commented 11 months ago

This is open for over 4 years. Really, how is this still an issue? Is ECS a deprecated product?

mlanett commented 7 months ago

This is not only a memory issue. Some containers NEED the daemons to be present and running, for instance logging daemons. The daemons must be brought up before regular tasks.

emorneau commented 7 months ago

I am facing the same issue. We got around the issue by using this workaround below but would love for AWS to resolve this issue to reduce ECS server configuration needed (aka complexity)

Workaround: for every application ECS service we added a ECS Service placement constraint configuration of "memberOf (task:group == service:deamonServiceName)" to make the Deamon service start a task on the ECS instance before any application tasks are assigned.

aws / containers-roadmap

[ECS] [request]: Prioritize Daemon task scheduling above Replica tasks #428