[ECS] Automatic DRAINING state on spot retirement

aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).

https://aws.amazon.com/about-aws/whats-new/containers/

Other

5.21k stars 320 forks source link

[ECS] Automatic DRAINING state on spot retirement #190

Closed tyrken closed 5 years ago

tyrken commented 6 years ago

Summary

Please auto-set a spot instance to DRAINING when it's being terminated

Description

When a spot instance is terminated (e.g. by scale down or price event), a 2-minute notification is given via a magic URL: http://169.254.169.254/latest/meta-data/spot/termination-time

I was hoping ECS agent could monitor this & set the container instance state to DRAINING automatically. This allows slightly cleaner scale-downs than merely dropping containers as the instance powers-off.

I see others have implemented local scripts or code to do this themselves, e.g. https://github.com/ktruckenmiller/aws-ecs-spot-instance-drainer

Expected Behavior

Container Instance is set to DRAINING briefly before vanishing.

Observed Behavior

CI stays in ACTIVE until it disappears.

Environment Details

Using agent version 1.20.3 currently

mats16 commented 6 years ago

This is a good issue. But I think ECS should not care host instance layer. Because ECS can be run on any other platform.

So I resolved it with CloudWatch Event. This is not the best solution, but better. https://github.com/mats16/ecs-spot-deregister/blob/master/CloudFormation/ecs-spot-deregister.yaml

tyrken commented 6 years ago

I like the idea of the Lambda you link to - but I'm not sure from reading the docs whether that new CloudWatch Event gets triggered for all cases of termination, e.g. when done manually via AWS Console or EC2 API call to terminate instances or modify/cancel the spot fleet request.

I see you earlier did a conventional on-host url watcher in https://github.com/mats16/ecs-spot-agent - other than architectural beauty is there any other reason why the re-implementation?

Either way I know these work-around/add-ons exist & am likely to implement one, but it would be nicer for me & others if it arrived as an ECS Agent feature, so it was already there by default.

coultn commented 5 years ago

Thanks everyone for the feedback on this issue. I wanted to let you know that we on the ECS team are aware of this issue, and that it is under active consideration. +1's and additional details on use cases are always appreciated and will help inform our work moving forward.

francesco-cambiaso commented 5 years ago

pedrini77 commented 5 years ago

sakuya9t commented 5 years ago

orsigiorgio commented 5 years ago

nikoizs commented 5 years ago

dancallan commented 5 years ago

devgrok commented 5 years ago

We had an issue end of last year where AWS Batch jobs were failing - it would put the job on an ECS host and be terminated and keep being assigned to hosts just to be terminated. The jobs seemed to either get assigned to a 'good' host or be continually assigned to hosts which were terminated until the retries were used up.

(I didn't have all of the necessary data dumped nor the time to fully investigate, but....) My theory was that the ECS agent would accept then fail jobs as it was trying to shutdown gracefully and I wanted to implement a spot termination listener job (i.e. as described above) on the hosts to see if it that was the cause.

Though ended up just working around via increasing retries, decreasing the max cluster/compute environment size and adding more instance types to reduce server churn.

rwolfson commented 5 years ago

sumitverma commented 5 years ago

MarcusNoble commented 5 years ago

It'd be good to see this in EKS as well if possible.

ekini commented 5 years ago

The old way of polling the URL in a loop works, but an event-based Lambda is much better, just because it doesn't consume resources when it's not required, and one Lambda can handle multiple clusters. Fits into the free tier as well.

Here is another implementation: https://github.com/springload/spotasaurus a Lambda as a terragrunt module.

We can see only half of the control plane here, which is ecs-agent. But there is ECS cluster software running somewhere on AWS, and that would be the best place to implement handling of spot termination.

coultn commented 5 years ago

This feature will work similarly to https://github.com/aws/containers-roadmap/issues/256, except that ECS cannot prevent a Spot instance from terminating.

alvarow commented 5 years ago

saurabtanej commented 5 years ago

rdawemsys commented 5 years ago

orsigiorgio commented 5 years ago

barakseri1 commented 5 years ago

pavneeta commented 5 years ago

Hi Everyone, Today, September 27th , Amazon ECS today launched Automated Draining for Spot Instances running ECS Services. This feature will enable ECS customers to safely manage any interruptions of ECS services running on Spot instances due to termination of the underlying EC2 Spot instance. https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-ecs-supports-automated-draining-for-spot-instances-running-ecs-services/

innokentiyt commented 5 years ago

Any news on this? https://github.com/aws/containers-roadmap/issues/256