Closed nexus49 closed 7 years ago
It sounds like what you want is to have a 'draining' state on a container instance. I'll take this as a feature request. I don't think there's any better way than you outlined above to do that right now.
Thanks for the suggestion!
I'd like to see this as well. I'm investigating the usage of Spot instances in an ECS cluster (whether Spot instances should be used in an ECS cluster or not is probably a different question...), and it would be very useful to stop tasks gracefully on containers that are about to be terminated.
I would also like to see that functionality. One use case is that you want to change instance type or AMI without interruption.
+1!
+1!
:+1: this would be huge for us as well it's a pretty tedious task if you have to maintain QoS
:+1:
Same for our project. We found no simple way to temporarily de register an instance for maintenance or containers upgrade, we had to script a lot. Having that in a single cli command would make ECS perfect.
+1
Or at the minimum if we can have a way to put the ecs-agent in a mode where it doesn't accept new ecs tasks. Then we can gradually drain the existing tasks using the standard ecs API. If the tasks were part of a service, they would be automatically started on some other instance.
Do you guys have any input if this will make it into the product or when?
Sorry, but as a general rule we typically don't comment on timelines. We'll update this issue as we are able to.
:+1:
+1
:+1:
+1, especially if it integrates with EC2 Autoscale Lifecycle hooks so that a node that is terminating has an opportunity to gracefully drain tasks away.
Copying my comment from #210 because I think it is a nice way to solve this:
Is the list of attributes
used during instance registration customizable using ecs.config
?
Quick use-case:
attribute
(eg. my-container-instance-v3
)requiredAttributes: ["my-container-instance-v3"]
+1 on this as well
If you're using ECS while running a custom AMI and ever want to perform updates, this is pretty painful right now.
I think the desired behavior that we would like to see would be something along the lines of this:
SIGTERM
to the ECS agent. This would disable the task engine from processing new tasks, and start killing the current tasks that it's managing, releasing them back to ECS to be re-scheduled.If we got both no. 1 and no. 2 above, then we could use lifecycle hooks with ASG to kill hosts as quickly as possible, while ensuring that all ECS tasks are healthy. I realize that no. 2 is probably difficult, or unrealistic, but I think even having no. 1 would be a big win.
Sorry, but as a general rule we typically don't comment on timelines. We'll update this issue as we are able to.
@euank can you tell us if you are considering to implement such a feature?
@seiffert I can't say since I no longer work on ECS (nor at Amazon). I don't want to put words into the mouth of the team, but what I said before is probably still true (that this issue will be updated if/when there's anything worth saying).
Cheers, Euan
Does this (workaround) process achieve the results we need?
DeregisterContainerInstance
to remove the instance we're destroying from the cluster.Our current workaround to avoid outages in production during a cluster instance rolling update uses a lambda:
This is generally enough to allow ECS to reschedule any tasks that are part of a service to another instance. The only tasks we run that are not part of a service are launched during instance creation and tied to that instance (e.g. cadvisor) so for us the fact those tasks die with the instance isn't an issue.
Downsides:
At the very least, we'd love to see a feature in ECS where we could poll ECS after deregistering an instance to determine when all service tasks from that instance have been successfully rescheduled elsewhere.
One small improvement on the above and to avoid an arbitrary 20mins is to capture running tasks at the time of deregistration and initalise a waits on service to be stable again.
Just tried doing the above (manually), and it didn't work out. After I deregistered the container instance, ECS deregistered (from ELB) the container instance before starting a new one, which may be unsafe for some (and certainly if doing your entire fleet in parallel)
@mattcallanan would you mind sharing your lambda function code?
Hi @mattcallanan, this sounds pretty similar to what the article The Hook, the Message and the Function describes. We're using this setup in production very successfully.
@schickling deregister_from_ecs_cluster() in the article @seiffert links to is very similar to our approach which in conjunction with ASG heartbeat worked well for orphaned tasks still being able to serve traffic behind an ELB and avoid outages.
But... in recent times, we've noticed that when an instance is deregistered from the cluster, the containers are automatically deregistered from the ELBs for their respective Services. The ECS documentation has been updated to reflect this: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/deregister_container_instance.html
Any containers in orphaned service tasks that are registered with a Classic Load Balancer or an Application Load Balancer target group are deregistered, and they will begin connection draining according to the settings on the load balancer or target group.
Unfortunately this means that outages are possible during cluster updates (depending on how many tasks there are per service and how quickly tasks can be launched on their ultimate new destination instances, etc.). For now, we're waiting on the instance draining feature mentioned on this thread. Having said that, suspending the termination for 20mins with the ASG lifecycle heartbeat is still useful to give containers a little extra time before the instance is terminated while they're being deregistered from their ELBs.
Today we launched Container Instance Draining to address this use case.
Container instance draining is great news. @cbbarclay, how do we use this in CloudFormation? Let's say I have an autoscaling group with a launch configuration for container instances. If I change the AMI in the launch configuration and update the CloudFormation stack, then I would expect it to drain the container instances before replacing them, but it doesn't do that.
@keycore-ho I'm trying to put together a plan for this, myself. The pattern I've mapped out so far is a variant of The Hook, the Message and the Function, where the lambdas would put the instance in a draining state, rather then deregistering from the elb and the ecs cluster.
Folks, we have an update for that blog post we'll be publishing shortly. Since its writing, the ECS API changed which has made some of the old way we accomplished this obsolete.
could always just use a pre-stop script in the docker init, like upstart supports. that's a bit simpler then all the lifecycle/lamba stuff.
basically edit this after the docker install to add a pre-stop script: https://github.com/docker/docker/blob/master/contrib/init/upstart/docker.conf
@danbf yeah, that's the route I initially thought, but I'm using ECS optimized AMIs with an ancient version of upstart (which it appears is the mechanism the ecs-agent is run) and docker running under SysV init. I think getting it right with that combo might be tricky.
Aaaand, like clockwork, AWS publishes the blogpost with the variant I described above. It would be cool to see a tighter ECS/Autoscaling integration that just does this, but until that day, this works for me.
https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/
@acmcelwee my previous comment no longer works. now when I trigger a reboot, the docker pre-stop stuff happens as expected. but if I issue a terminate then it immediately seems to drop the node from ECS, which then kills the ELB registrations even with a docker upstart script with a pre-stop clause.
it seems all roads are moving to ASG lifecycle only usage. unless I override the terminate lifecycle hook which seems kinda not a great idea.
Hi, currently there does not seem to be a way to remove a container instance that runs tasks from a cluster without downtime without a number of steps that have to happen manually to accomplish the same goal.
It would be really helpful if one could mark a container instance as to be deregistered and ecs would take care of reorganizing the tasks first to other nodes before taking the container instance out of the cluster.
Please let me know if i missed something and there is already a way.
Manually you'd have to do:
Sometimes you just want to replace all nodes in your cluster because the ecs/docker versions changed or some other change that you want to roll out without service interruption.
Cheers