[ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING

Alonreznik commented 6 years ago

Hello There. We're facing lately a strange practice from ECS about tasks that prevent from new tasks to run in an instance.

A little about our case - We have some tasks that need to complete their action and then exit by themselves in a stopTask command. That means we have a gracefull-shutdown process, that in sometimes make some time to complete (more than few seconds and even some long minutes).

However, when a stopTask is sent over these tasks, they do not appear anymore in the tasks at ECS console (which is fine), but they also make all other tasks in the same instance that are trying to change their state from PENDING to RUNNING.

Here is an example of one instance tasks when it happens:

Why is that behavior happens? Why one task prevent from the other to run next to it until it done? This is a bad practice in resource management (we don't use the max potential of our instances at the pending time).

The best thing will be that the stopped task will appear in the console until it really stopped in the instance, and the state changing from PENDING to RUNNING won't be affected by other tasks in the same instance.

I hope you can fix that behavior,

Thanks!

petderek commented 6 years ago

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources. For example, a webserver would still need its resource pool while waiting for all connections to terminate. The alternative would be to allocate fewer resources when a task transitions to stopping, but that's not a safe assumption to make across all workloads.

Would configurable behavior help your use case? Or is it sufficient to be more clear about this behavior in the console?

Alonreznik commented 6 years ago

Hey @petderek. Thank you for your response.

I get it works like that by design. However, i wonder why one task should need to prevent all other to be running.

The best configurable behavior for us will be a task-for-task process: calculating the resources that are held by the stopping task (which is fine and reasonable), but not preventing from other tasks to be running on the instance while it has a resource to give.

In our use case, the tasks are used for a long-poll workload that should run as a service, and not for a web client. That behavior makes our instances not be filled in time and can also "stuck" our process in a new deployment, because instances are waiting for one task to end until the other tasks are allowed to run.

So the instance is actually in kind of "disabled" or "draining" status until the long workload is done (and it can take some time).

What can we do in our use case to be supported in ECS?

Thanks

Alonreznik commented 6 years ago

Hi! Is there any update about that thing? How can we make any solution or workaround in order to run the task-for-task mechanism?

Thank you in advanced!

FlorianWendel commented 5 years ago

Hi everyone,

we are facing the exact same issue. @Alonreznik is right, one task is blocking all other tasks and in my opinion this does not make sense. Let me illustrate:

Assume we have one task with 10 GB memory reservation running on a container instance that has registered with 30 GB. The container instance shows 20 GB of RAM available and that is correct. Now this task is stopped (the ECS agent will make docker send a SIGTERM) but the container keeps running to finish its calculations (now it shows under stopped tasks as "desired status = STOPPED" and "last status = RUNNING"). The container instance will now show 30 GB available in the AWS ECS console which is nonsense, it should still be 20 GB since the container is still using resources as @petderek mentioned. Even worse, if we try to launch three new tasks with 10 GB memory reservation each, they will all be pending until the still running task transitions to "last status = STOPPED". Expected behavior would be that two out of the three tasks can launch immediately.

I hope my example was understandable, else feel free to ask. And thanks for looking into this :)

yumex93 commented 5 years ago

Hey! As a workaround, you can set ECS_CONTAINER_STOP_TIMEOUT to a smaller number. This is used to configure 'Time to wait for the container to exit normally before being forcibly killed'. By default, it is set as 30s. More information can be found here. I marked this issue as a feature request and we will work on it soon.

Alonreznik commented 5 years ago

Hi @yumex93 Thank you for your response. We will be very happy to have that feature as soon as it will be out :)

About your workaround - in most cases, we need our dockers to make a graceful shutdown before they die. Therefore, decreasing the ECS_CONTAINER_STOP_TIMEOUT will cause our workers to be killed before the shutdown is completed. Therefore the feature is more than needed :)

Thank you again for your help, we're waiting for updates about it.

Alon

FlorianWendel commented 5 years ago

@Alonreznik , @yumex93 We have the same situation, some workers even take a few hours to complete their task and we've leveraged the ECS_CONTAINER_STOP_TIMEOUT to shut those down gracefully as well. Since ECS differentiates between a "desired status" and a "last status" for tasks, I believe it should be possible to handle tasks in the process of shutting down a bit better than how it works today. For illustration of what I mean, see this screenshot:

ecs-bug

The tasks are still running and still consume resources, but the container instance does not seem to keep track of those resources. If this is more than just confusing display, I expect it to cause issues, e.g. like the one above.

Alonreznik commented 5 years ago

Hi @yumex93, Any update with that issue?

Thanks

Alon

yhlee-aws commented 5 years ago

We are aware of this issue, and we are working on prioritizing it. We will keep this issue open to track this issue, and provide update when we have more solid plans.

Alonreznik commented 5 years ago

Hi @yunhee-l. Thank you for your last response. We're still facing this issue, which demands from us to luanch more servers than we need in our deployments, and makes our workloads stuck. Any update in that case?

Thanks

Alonreznik commented 5 years ago

Hi @yunhee-l @FlorianWendel any update?

yhlee-aws commented 5 years ago

We don't have any new updates at this point. We will update when we have more solid plans.

yhlee-aws commented 5 years ago

tomotway commented 5 years ago

Hi,

Just wanted to add our experience with this with the hopes that it can be bumped in priority.

We need to run tasks that can be long running. With this behaviour as it stands it essentially locks up the ec2 instance so that it cannot take any more tasks until the first task has shut down (which could be a few hours) It wouldn't quite so bad if ecs marked the host as unusable and placed tasks on other hosts but it doesn't, it still sends them to the host that cannot start them. This has the potential to cause us service outage in that we cannot create tasks to handle workload (we tell the service to create tasks but it can't due to the lock up)

Thanks.

Alonreznik commented 5 years ago

@petderek @yumex93 This is something really makes us pay more than the resource we need each deployment. As you can see, there is more than one user who suffers that kind of basic designing.

Do you have any ETA for implementing it or deploying it? This is a real blocker for our ongoing processes.

Thank you

Alon

adnxn commented 5 years ago

@Alonreznik: thanks for following up again and communicating the importance of getting this resolved. this helps us prioritize our tasks.

we don't have an ETA right now - but have identified the exact issue and have a path forward that requires changes to our scheduling system and the ecs agent. so to give you some more context. as @petderek said earlier,

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources.

so changing this behavior will be a departure from our existing way of accounting resources when we schedule tasks. considering that the current way has been in place since the beginning of ECS, the risks involved with changing this are significant as there could be subtle rippling effects in the system. we plan to explore ways to validate this change and ensure to not introduce regressions.

the original design made the trade off towards oversubscribing resources for placement by releasing resource on the instance when tasks were stopped - but the side effect of that is the behavior you are describing. additionally, now that we've added granular sigkill timeouts for containers with #1849, we can see this problem being exasperated.

so all that is to say - we're working on this issue and we will update this thread as we work towards deploying the changes.

Alonreznik commented 5 years ago

@adnxn Thank you for your detailed explanation. It helps very much understanding the context of the situation.

We off course get this is something that is built in the design and we accept it.

However, I assume that our intentions are not for this radical change in the core system (which is great!!). Our request is based on the ecs-agent assumption of all of the resources must be released in the instance from the last tasks, and our request is just to handle it by task (and also have some indication the task is still running on the instance backward after it got the SIGTERM).

As it looks today, the resource handling and releasing are based on the entire instance, and not on the tasks that run over the instance. So if a task releases its resources, the ecs-agent should allow scheduling these resources for new tasks (it they stand on the resource requirements).

Thank you for your help! Much appreciated!

Please keep us posted,

Alon

Halama commented 5 years ago

Hello, we are affected with the exactly same issue. ECS service deploying long-poll workers with stopTimeout set to 2 hours. Task in running state with desired status stopped block all new tasks scheduled on the same instance even there are free resources available.

Adding a new instances to the cluster helped US to workaround this situation, but it can be really costly if there are multiple deploys each day.

Are there any new updates about this issues? or possible workarounds.

It could definitely be solved by removing the long poll service and switch it to just calling ECS RunTask (process one job and terminate) without waiting for the result. But it would require more changes in our application architecture and also it would be more tightly coupled to ECS.

thanks Martin

Alonreznik commented 5 years ago

Hi @coultn @adnxn Any update or ETA about that?

Thank you

Alon

Alonreznik commented 5 years ago

Hi Guys. Can somebody take a look about that? This is harming our business because we have a problem with deploying new versions to our Prod. This is really problemtaic, and it shades a dark light about continuing using the ECS in our side.

Thanks

Alonreznik commented 5 years ago

@coultn

coultn commented 5 years ago

Hi, thank you for your feedback on this issue. We are aware of this behavior and are researching solutions. We will keep the github issue up to date as the status changes.

Alonreznik commented 5 years ago

Hi @coultn . Thanks for your reply.

We must say this is something prevents our workloads to grow accordingly to our tasks, and there are situations this behavior actually stuck our production servers. Again, something that can be a no-go (or no-continue in our case) using ECS in prod.

For example, you can see a typical production workloads desire/running gap.

The Green layer is the gap between the desired and the running (orange layer) tasks. The blue is the PENDING tasks in the cluster. You can see a constant gap between these two parameters. No deployment was made today and this is something we're encountering in scaling up mechanism.

Think about the situation we're encountering. We have new tasks in our queue (SQS), and therefore we're asking from the ECS to run new tasks (means - desire tasks increasing). Each workload is a task in the ECS, and all of them split between the servers. When we have some workload take more than some time to complete (and there are many of them because we're asking for the workload to end it's task before it ends and then die) one workload blocks the entire instance to get new one workload, even there are free resources in the instance.

The ECS agent schedule new workloads to that instance, and then hits the one task that is still working. For the agent - he made its job - he scheduled new tasks. But the tasks are stuck in PENDING state, for hours in some cases, makes this instance to be unusable because they're just not working yet. Now think about, that you need to raise the more 100 tasks in some hours to complete a quick workloads in the line, and you have 5-6 instances with one task blocks each one, and it becomes to be a mass.

We also must say we encounter this in the last year only, after some upgrade of the agent a year or year and a half ago.

We need every day to ask for more instances in our workloads in order to open the block. This is not how a production service in AWS should be maintained, and we're facing that again and again in this case, every day.

Please help us to continue using ECS as our production orchestrator. We love this product and want it to succeed, but as it seems, it doesn't fit to long-working tasks.

Your help of rushing this in your team will be kind,

Thank you

Alon

Halama commented 5 years ago

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

Alonreznik commented 5 years ago

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

Hi @Halama .

Thanks for the reply and the update.

I can get this is not something that can be quick to solve, but meanwhile, ECS team can provide workarounds, such as placing-method of binpack and the newest instances, or determine the time task can be on PENDING state before it tries on the new instances. This issue is not getting any response due to many users encountering that. It is open more than a year and they're can't send any reasonable ETA (even 3 months is good for us). It was on "researching" just in the last week.

Can you please share about your migration process to EKS from ECS?

Thanks again

Alon

coultn commented 5 years ago

@Alonreznik, would you be willing to share more details about your specific configuration via email? ncoult AT amazon.com. Thanks

Alonreznik commented 5 years ago

hi @coultn. Thanks for your option. Just sent a detailed email with our architecture and configuration, and the problem we're facing.

Thanks, and appreciated!

Alon

Alonreznik commented 4 years ago

Hi Guys. We've actually faced this again today at the prod, where there was a huge gap for almost an hour between the desired task and the running task. This is not something we can still rely on, and it let a big shade on using ECS on production for our main system, because it means the PENDING tasks are just "stucking" the entire instance and we need to refresh the entire task placement again every time it happens.

In the next graph, you can see the gaps:

The green is the desired task number per minute, and the orange is actual number of the running tasks. At some point in time, there were more than 70 tasks asked for starting and got stuck because of one (!!!) task that is still running on each instance. We don't also have the ability to set new tasks on new instances only, so nothing we can do about that.

This is a big lack of the service, and makes it unstable in our terms. Please fix it as soon as possible,

Alon

coultn commented 4 years ago

@Alonreznik Yes, this is a known issue (as we have discussed previously). Here is one solution that we are considering implementing soon:

Introduce a new subject field for the ECS cluster query language, called stoppingTasksCount. This would be similar to the existing field called runningTasksCount, except that it would be the count of tasks on an instance that are in the STOPPING state.

With this new subject field, you could use a placement constraint of this form:

"placementConstraints": [
{
    "expression": "stoppingTasksCount == 0",
    "type": "memberOf"
}
]

This placement constraint would prevent the task from being placed on any instance that has a stopping task. So, new tasks would only be placed on instances with no stopping tasks. Please let us know if you have questions/comments about this proposed solution.

tom22222 commented 4 years ago

Hi @coultn Thanks for the update on this. How would this work in this example scenario:

I have 1 ecs cluster with 10 ec2 instances, ECS_CONTAINER_STOP_TIMEOUT is set to 6 hours. I have a service with 10 tasks that are distributed evenly over all 10 nodes (1 task per node). I tell the service to scale in to 0 tasks but the tasks are currently busy so they stay in a stopping state until they have finished their work (or 6 hours expires). I then try to create another service on the same cluster while these long running tasks are still completing but, despite there potentially being lots of cpu and memory available, the other service is not able to start? In my mind this still means the whole cluster is essentially locked for up to 6 hours. Or am I misunderstanding the proposed solution here?

I don't know the full details of how ECS works but it seems more sensible to me to base the allocation of tasks to the hosts based on the actual free cpu and memory of that host (total cpu / ram minus (tasks running + tasks stopping)) If there are tasks stopping on a host you should still be able to start tasks on that host if there is sufficient resource, when the tasks finally stop you tell ECS the cpu / ram resource has become available.

coultn commented 4 years ago

@tom22222 In your specific scenario, I would recommend (1) sizing the instances as close as possible to the task size (since you are only running 1 task per instance) and (2) enabling cluster auto scaling - this will cause your cluster to scale out to accommodate the new service automatically, assuming the placement constraint feature is implemented as I proposed above.

However, we may also implement additional changes to account for stopping tasks’ resources as you proposed, but at the present time the placement constraint approach will be quicker to launch. The reason for this is that the idea you proposed above would change the default behavior of task placement for all customers, even if they are not having the problem described in this issue. We typically approach changing default behavior for all customers more slowly than a feature that doesn’t change default behavior but which can be opted into for those customers that need it (such as the placement constraint approach I proposed above).

tom22222 commented 4 years ago

Hi @coultn

Thanks for the response. So on your suggestions:

I would recommend (1) sizing the instances as close as possible to the task size (since you are only running 1 task per instance)

I just gave that as an example of how 1 service could be configured, we could be running lots of other services with tasks of various sizes on the cluster at the same time and the issue would still be the same

(2) enabling cluster auto scaling

Yes, I agree, using the new Capacity Provider stuff along with your proposed change here could work and should allow for new services to be started although the drawbacks here would be that it would take some time to spin up additional ec2 instances to handle the new task requirements and all of the existing ec2 instances are still locked despite the fact they have plenty of cpu / ram available (so we will therefore have poor scaling performance across the whole cluster and we are wasting resources). Don't get me wrong, I do think this is better than what we have now but it feels more of a workaround with some drawbacks.

We typically approach changing default behavior for all customers more slowly than a feature that doesn’t change default behavior but which can be opted into for those customers that need it

I completely agree there would be a risk in changing the default behaviour so I would suggest you have it as another option where the default is the current behaviour .....

ECS_CONTAINER_STOPPING_TASK_PROCESSING=completed (this would be the current way it works) ECS_CONTAINER_STOPPING_TASK_PROCESSING=immediate (this would allow other tasks to be started even if there were tasks in a stopping state)

(I'm sure you can think of better option names but hopefully you get my point)

Thanks

Halama commented 4 years ago

The issue also affects DEAMON services. Daemon containers doesn't start on new instance in case there are RUNNING daemon containers in desired state STOPPED on DRAINING instances. I think it is because the Automatic Desired Count of DAEMON service also counts the RUNNING container in Desired Status STOPPED.

rs-garrick commented 4 years ago

I'm also trying to take advantage of a long stopTimeout. Preventing new tasks from running only makes sense when the stopTimeout is only a few seconds. Now that stopTimeout can be very long, having stuck ECS hosts is silly.

I propose that tasks should not deregister from the service until after the container has actually exited. The task should enter a new state called STOPPING once the initial stop has been sent to the container. In this way the resources are still tracked and scheduling new tasks can be assigned appropriately. After the container exits, then the task can move into STOPPED and deregistered.

Alonreznik commented 4 years ago

Any update with this one?

Alonreznik commented 4 years ago

@coultn any update in here? Something that may be an option?

This is 2 years old issue!

thom-vend commented 3 years ago

Hi, @pavneeta any update on this issue ?

AlexisConcepcion commented 3 years ago

Any update on this ?

estoesto commented 3 years ago

I'm running 1 task per host, with autoscaling, but everything gets piled up in the MQ because of this one stopping task (which runs daily and should stop gracefully). Also CICD pipelines fail since I'm leveraging aws ecs wait services-stable. Only workaround that works for me is to modify the capacity provider to run extra instances. What a waste.

@coultn Your suggestion would solve it. Any ETA for this?

AlexisConcepcion commented 3 years ago

We recently implemented Datadog and cAdvisor as Daemons for ECS using cloudformation, we have more than 20 stacks, a few of them running about 10 instances (bigger ones). At the first try deamons took about 5 hours to be running. The key to improve performance and get the new daemon tasks running was to set the MinClusterSize=1 (it was not previously defined) and the following placement strategy on ECS-Service.yaml, ( after those modifications we deployed the daemons ).

 PlacementStrategies:
 - Type: spread
   Field: instanceId
 - Field: memory
   type: binpack

We are planning to apply it on prod soon, take in mind the placement strategy performs a rollout of your running instances, I don't thinks it is a solution but it could help!

Alonreznik commented 3 years ago

Any update about that? We love ECS, but this use case drives us into Kubernetes, which solves this case easily.

Alonreznik commented 3 years ago

BTW - 3 years (!!!!) after this issue had been opened, and still many people facing this unexpected behaviour. I think this is a good reason to make it fixed once for good.

Alonreznik commented 2 years ago

Hi @petderek. Any update?

markdascher commented 2 years ago

AWS seems to be overly cautious regarding a fix, and I think it's because the issue still isn't clearly understood by everyone involved. I'm not entirely sure that I understand it myself, but after reading the whole thread, here's what it seems to boil down to:

A 40 GB host has a single 10 GB task. It can start three more 10 GB tasks just fine. Everyone is happy.
The same 40 GB host has the same 10 GB task, but now that task is stopping. Suddenly we can't start any new tasks on this host, even though there are 30 GB available.

Scenario 2 makes no sense. It's clearly a bug. The phrase "by design" doesn't belong in this thread. I understand how it could've happened though–it's perhaps an unfortunate workaround for an older bug:

Bug A: When tasks are stopping, the system calculates available resources incorrectly. Perhaps the calculation shows "40 GB free" instead of "30 GB free."
Bug B: Rather than fixing Bug A, the ECS Agent includes logic to know when the calculation is incorrect, and then decides to freeze (with potentially catastrophic consequences) during that timeframe.

Is that accurate? Are we actually worried about the unintended consequences of fixing Bug A?

In our case shortening stopTimeout isn't a viable option, and neither are placementConstraints. Every host may have tasks stopping at the same time, so placementConstraints would just continue making them all unusable. (And even in a best case, it would result in very suboptimal placement as everything gets squeezed onto a small number of usable hosts.)

Two possible fixes:

When tasks are stopping, continue calculating available resources correctly. In the example above, that means there are only 30 GB free until the container is actually gone.
If that's too drastic, then at least make the ECS Agent try harder. If tasks are stopping, make the ECS Agent correct the calculations locally, and continue if it's safe to do so. If you're unlucky and tasks get placed onto a host that's actually full, then you're out of luck. But that's still way better than where we are now, and at least isn't completely baffling behavior.

Alonreznik commented 1 year ago

Hi everyone. It seems that this is just won't be prioritized, and the ECS team just says "we're living with the bug", while this bug just prevents from so many users to do BASIC tasks on ECS, such as just "Run tasks that works". Can someone provide some attention on it?

AbhishekNautiyal commented 1 year ago

We are excited to share that we've addressed the known issue in ECS Agent to prevent tasks stuck in pending state on instances that have stopping tasks with long timeouts. For details on the root cause, fix, and other planned improvements, please see What's New Post, Blog Post, and documentation

We'll be closing this issue. As always, happy to receive your feedback. Let us know if you face any other issues.

Alonreznik commented 1 year ago

Holy moly!!! 5 years!! Amazing guys! I'm so excited! Thank you so much 🙏🙏

aws / containers-roadmap

[ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING #325