aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 318 forks source link

[ECS] [request]: Add Readiness Checks #1670

Open dastbe opened 2 years ago

dastbe commented 2 years ago

Community Note

Tell us about your request

Today in ECS, any healthcheck failure results in the immediate termination of the task. While this is theoretically desirable, in practice this can exacerbate outages rather than help. For example, a momentary blip in health across the fleet can lead to a minutes-long rotation of all tasks which, depending on what path ECS takes, can be disruptive to customers. Additionally, task replacement is comparatively expensive process, requiring various provisioning systems to be online.

Contemporaries like Kubernetes have opted to make a distinction between "liveliness" and whether a task should kept running or replaced and "readiness" and whether a task should be routed to. Having the ability to encode readiness checks as distinct from what exists today would help service owners configure their services to isolate tasks during periods of temporary instability without the full blown replacement.

caveat: one interesting aspect to this request is that some systems ECS integrates with have their own readiness checking built-in, i.e. ELB. Any such system should also be able to change how ELB's signals are treated by ECS.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

See above

Are you currently working around this issue?

Not particularly. This is a fundamental behavior of ECS and not one you can easily work around.

Additional context Anything else we should know?

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

geoffreywiseman commented 1 year ago

I imagine most people who read this are well aware of the reasons why this is desirable, but another example: restarting an ECS task can have other side-effects: an ECR download, ENI allocation. If liveness is fine and readiness is not, restarting ECS tasks adds a lot of noise and sometimes even costs to a system without merit. The task is live and restarting it doesn't change that.

If you're using ECS, try to make sure your ECS health checks are as close to a pure liveness check as possible -- that they do not rely on external resources. That doesn't mean that your service is "ready", but it's better than having ECS assume a task needs to be replaced because a dependency is down.

fierlion commented 1 year ago

https://github.com/aws/containers-roadmap/issues/1270 <- potentially related "Target Group starts making healthchecks from the moment target is registered. Hence even if service has HealthCheckGracePeriod Confifgured, it is possible that target is marked Unhealthy at Target group and UnhealthyHost Count metric is updated. "

"As a workaround, i can change the parameters of my health check to be either less frequent or require more consecutive failed checks before calling a target unhealthy, but that means that the health check will be that much slower at detecting any actual issues. Seems bizarre to me that the ECS health check has a grace period concept built in but the target group health checks don't."

nhlushak commented 1 year ago

Me and my team are looking forward for this feature long time ago. I totaly agree with other commenters and thinking that the least ECS team could bring us is a toggle that would allow ignoring ELB healthchecks and take no action on them. This would bring so much relief for teams that develops API services. Because ELB by design does not deregisters and kills unhealthy targets, it just removes traffic from them.

Speaking separatelly about current ECS flow for mitigating unhealthy targets: it is strange to me, that it does not trigger rolling update, but rather just stops unhealthy task and only then deals with Running count != Desired Count.

allenbrubaker commented 1 year ago

Time to move to lambdas.

genbit commented 11 months ago

We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy. You can read more in this WNP: https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-ecs-applications-resiliency-unpredictable-load-spikes/ Blog post with deep dive: https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/

wdolek commented 9 months ago

@genbit ECS improvements you linked are definitely great, but lacking proper "readiness" still causes warmup phase quite cumbersome. I skimmed trough documentation but couldn't find anything related.

Our application requires 30 to 60s to fully warm-up (fetch content, process, warmup local cache). We don't want any request to land instance which is not being fully warmed-up, at the same time our health check indicates whether instance is in correctly state. To achieve this we resorted to put artificial delay by healthCheckGracePeriod (CDK, property of ApplicationLoadBalancedFargateService) as well as tweaking healtcheck's healthyThresholdCount and unhealthyThresholdCount properties.

This however doesn't prevent requests reaching just warming up instances, causing requests to wait (cache locking) - and perhaps timeout when client expects faster response.

It would be beneficiary to see differentiation of readiness and liveness. (however I have to admit we don't think of going to overhead of K8s)

JustinReshop commented 8 months ago

+1 to this request.

Killing tasks that are not capable of handling more traffic because they are too busy handling (possibly slow) other requests, rather than them being dead, is far from ideal and only exacerbates the problem.

fahd-sainsburys commented 3 months ago

+1 to this request

velhoi commented 2 months ago

+1 to this request