Closed tyrken closed 5 years ago
This is a good issue. But I think ECS should not care host instance layer. Because ECS can be run on any other platform.
So I resolved it with CloudWatch Event. This is not the best solution, but better. https://github.com/mats16/ecs-spot-deregister/blob/master/CloudFormation/ecs-spot-deregister.yaml
I like the idea of the Lambda you link to - but I'm not sure from reading the docs whether that new CloudWatch Event gets triggered for all cases of termination, e.g. when done manually via AWS Console or EC2 API call to terminate instances or modify/cancel the spot fleet request.
I see you earlier did a conventional on-host url watcher in https://github.com/mats16/ecs-spot-agent - other than architectural beauty is there any other reason why the re-implementation?
Either way I know these work-around/add-ons exist & am likely to implement one, but it would be nicer for me & others if it arrived as an ECS Agent feature, so it was already there by default.
Thanks everyone for the feedback on this issue. I wanted to let you know that we on the ECS team are aware of this issue, and that it is under active consideration. +1's and additional details on use cases are always appreciated and will help inform our work moving forward.
+1
+1
+1
+1
+1
+1
We had an issue end of last year where AWS Batch jobs were failing - it would put the job on an ECS host and be terminated and keep being assigned to hosts just to be terminated. The jobs seemed to either get assigned to a 'good' host or be continually assigned to hosts which were terminated until the retries were used up.
(I didn't have all of the necessary data dumped nor the time to fully investigate, but....) My theory was that the ECS agent would accept then fail jobs as it was trying to shutdown gracefully and I wanted to implement a spot termination listener job (i.e. as described above) on the hosts to see if it that was the cause.
Though ended up just working around via increasing retries, decreasing the max cluster/compute environment size and adding more instance types to reduce server churn.
+1
+1
It'd be good to see this in EKS as well if possible.
The old way of polling the URL in a loop works, but an event-based Lambda is much better, just because it doesn't consume resources when it's not required, and one Lambda can handle multiple clusters. Fits into the free tier as well.
Here is another implementation: https://github.com/springload/spotasaurus a Lambda as a terragrunt module.
We can see only half of the control plane here, which is ecs-agent. But there is ECS cluster software running somewhere on AWS, and that would be the best place to implement handling of spot termination.
This feature will work similarly to https://github.com/aws/containers-roadmap/issues/256, except that ECS cannot prevent a Spot instance from terminating.
+1
+1
+1
+1
+1
Hi Everyone, Today, September 27th , Amazon ECS today launched Automated Draining for Spot Instances running ECS Services. This feature will enable ECS customers to safely manage any interruptions of ECS services running on Spot instances due to termination of the underlying EC2 Spot instance. https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-ecs-supports-automated-draining-for-spot-instances-running-ecs-services/
Any news on this? https://github.com/aws/containers-roadmap/issues/256
Summary
Please auto-set a spot instance to DRAINING when it's being terminated
Description
When a spot instance is terminated (e.g. by scale down or price event), a 2-minute notification is given via a magic URL: http://169.254.169.254/latest/meta-data/spot/termination-time
I was hoping ECS agent could monitor this & set the container instance state to DRAINING automatically. This allows slightly cleaner scale-downs than merely dropping containers as the instance powers-off.
I see others have implemented local scripts or code to do this themselves, e.g. https://github.com/ktruckenmiller/aws-ecs-spot-instance-drainer
Expected Behavior
Container Instance is set to DRAINING briefly before vanishing.
Observed Behavior
CI stays in ACTIVE until it disappears.
Environment Details
Using agent version 1.20.3 currently