[ECS] [request]: allow StopTask API to safely deregister and drain connections

zbintliff commented 4 years ago

Tell us about your request Currently when you call aws ecs stop-task my understanding is that ECS tells the Agent to immediately send SIGKILL to the container. Ideally, we would like to mark a task to be killed and for ECS to act similar to when a container instance drains. Those steps are:

Spin up new task and add to TG
Begin draining on old task (respecting the min % healthy threshold)
Issue SIGKILL to containers

Since this functionality is used for instance draining I hope it is something I have overlooked or will be easy to adopt.

Which service(s) is this request for? Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We have pretty large services (200+ tasks) that have local cache and many connections to downstream resources. Sometimes an app gets in a bad JVM state where GCs are happening more frequent than usual and we want to safely mark one task for termination while staying above our deploymentConfiguration healthy percent. Currently, the only way to do so is either drain the entire container instance or --force-new-deployment. A whole new deployment is an "expensive" process and if we issue a StopTask we see an increase in 5xx errors because no connections are drained.

Are you currently working around this issue? Right now we are doing aws ecs update-service --force-new-deploymentbut as I said its "expensive" in time and resources.

Please let me know if you have any questions!

coultn commented 4 years ago

Thanks for the request! So if I understand correctly, the ask is specifically for tasks running as part of a service. You would like the ability to manually set an individual task in a service to “draining” state, which will cleanly drain connections and register from the load balancer target group, and then initiate task shutdown; and also ensure that a replacement task is started up in accordance with the minimumHealthyPercent and maximumPercent parameters. Is that correct?

zbintliff commented 4 years ago

Exactly!

While we would love to have application solve this at the health check level sometimes the application is "healthy" but spending half the CPU cycles in a garbage collection churn. This request has popped up internally frequently lately. We have StopTask on one hand that is "harsh" action that definitely results in increased 5xx errors, and other the other hand redeploying 100+ tasks in a service because of one bad task is expensive.

chenrui333 commented 3 years ago

While this feature request is still in the open state, is there any way to mitigate the issue for production operations? Thanks!

ohookins commented 2 years ago

Still hanging out for this as well! Would appreciate any good workarounds.

slayer commented 2 years ago

Any news? AWS ECS is totally painful 😖 without such simple feature

jaredjstewart commented 1 year ago

@zbintliff Is this still necessary given https://github.com/aws/containers-roadmap/issues/708 having been resolved? It's not clear to me if there's any distinction between the two requests.

aws / containers-roadmap

[ECS] [request]: allow StopTask API to safely deregister and drain connections #576