Functionality to remove one or many container instance(s) that runs tasks without service interuption

nexus49 commented 9 years ago

Hi, currently there does not seem to be a way to remove a container instance that runs tasks from a cluster without downtime without a number of steps that have to happen manually to accomplish the same goal.

It would be really helpful if one could mark a container instance as to be deregistered and ecs would take care of reorganizing the tasks first to other nodes before taking the container instance out of the cluster.

Please let me know if i missed something and there is already a way.

Manually you'd have to do:

start task that is running on node x on another node
wait for that task to be up and registered to the load balancer
remove instance from load balancer and drain connections
stop task on node x
deregister container instance from cluster or terminate the instance directly

Sometimes you just want to replace all nodes in your cluster because the ecs/docker versions changed or some other change that you want to roll out without service interruption.

Cheers

euank commented 9 years ago

It sounds like what you want is to have a 'draining' state on a container instance. I'll take this as a feature request. I don't think there's any better way than you outlined above to do that right now.

Thanks for the suggestion!

keichan34 commented 9 years ago

I'd like to see this as well. I'm investigating the usage of Spot instances in an ECS cluster (whether Spot instances should be used in an ECS cluster or not is probably a different question...), and it would be very useful to stop tasks gracefully on containers that are about to be terminated.

2ndalpha commented 9 years ago

I would also like to see that functionality. One use case is that you want to change instance type or AMI without interruption.

pikeas commented 9 years ago

+1!

radenui commented 9 years ago

+1!

tj commented 9 years ago

:+1: this would be huge for us as well it's a pretty tedious task if you have to maintain QoS

mthenw commented 9 years ago

:+1:

antoine-galataud commented 9 years ago

Same for our project. We found no simple way to temporarily de register an instance for maintenance or containers upgrade, we had to script a lot. Having that in a single cli command would make ECS perfect.

maliksalman commented 9 years ago

+1

Or at the minimum if we can have a way to put the ecs-agent in a mode where it doesn't accept new ecs tasks. Then we can gradually drain the existing tasks using the standard ecs API. If the tasks were part of a service, they would be automatically started on some other instance.

nexus49 commented 9 years ago

Do you guys have any input if this will make it into the product or when?

euank commented 9 years ago

Sorry, but as a general rule we typically don't comment on timelines. We'll update this issue as we are able to.

bpascard commented 9 years ago

:+1:

bcwp commented 9 years ago

+1

tmornini commented 9 years ago

:+1:

jhmartin commented 9 years ago

+1, especially if it integrates with EC2 Autoscale Lifecycle hooks so that a node that is terminating has an opportunity to gracefully drain tasks away.

joostdevries commented 9 years ago

Copying my comment from #210 because I think it is a nice way to solve this:

Is the list of attributes used during instance registration customizable using ecs.config?

Quick use-case:

I've got a webapp running as service.
I want to change something at the container instance level (eg. logging, user accounts)
My ideal path:
- Create new ec2 instances and provision them.
- Register the new instances to the ecs cluster and give them a custom attribute (eg. my-container-instance-v3)
- Register a new task definition with requiredAttributes: ["my-container-instance-v3"]
- Update the service to the new task def
- The scheduler will now start the task on the new instances and once complete, I can terminate the old ones.

ejholmes commented 8 years ago

+1 on this as well

If you're using ECS while running a custom AMI and ever want to perform updates, this is pretty painful right now.

I think the desired behavior that we would like to see would be something along the lines of this:

Send a SIGTERM to the ECS agent. This would disable the task engine from processing new tasks, and start killing the current tasks that it's managing, releasing them back to ECS to be re-scheduled.
Ideally, block until all of the tasks that it was managing have been successfully placed onto a new host.

If we got both no. 1 and no. 2 above, then we could use lifecycle hooks with ASG to kill hosts as quickly as possible, while ensuring that all ECS tasks are healthy. I realize that no. 2 is probably difficult, or unrealistic, but I think even having no. 1 would be a big win.

seiffert commented 8 years ago

Sorry, but as a general rule we typically don't comment on timelines. We'll update this issue as we are able to.

@euank can you tell us if you are considering to implement such a feature?

euank commented 8 years ago

@seiffert I can't say since I no longer work on ECS (nor at Amazon). I don't want to put words into the mouth of the team, but what I said before is probably still true (that this issue will be updated if/when there's anything worth saying).

Cheers, Euan

igrayson commented 8 years ago

Does this (workaround) process achieve the results we need?

Call DeregisterContainerInstance to remove the instance we're destroying from the cluster.
Wait for ECS to allocate that task instance to another node.
Destroy the instance.

mattcallanan commented 8 years ago

Our current workaround to avoid outages in production during a cluster instance rolling update uses a lambda:

The lambda is triggered by AutoScaling Instance Terminating SNS events
The lambda deregisters the instance from the ECS cluster
The lambda also sends a heartbeat to the ASG to keep the instance in Terminating:Wait state for 20mins

This is generally enough to allow ECS to reschedule any tasks that are part of a service to another instance. The only tasks we run that are not part of a service are launched during instance creation and tied to that instance (e.g. cadvisor) so for us the fact those tasks die with the instance isn't an issue.

Downsides:

Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so tasks can get bumped from instance to instance until all instances are replaced
20mins is a long time for old containers to still be registered in the services' ELBs. Any deploys during that time can cause confusion around why old and new versions of service are running behind ELB.

At the very least, we'd love to see a feature in ECS where we could poll ECS after deregistering an instance to determine when all service tasks from that instance have been successfully rescheduled elsewhere.

tim-faase commented 8 years ago

One small improvement on the above and to avoid an arbitrary 20mins is to capture running tasks at the time of deregistration and initalise a waits on service to be stable again.

igrayson commented 8 years ago

Just tried doing the above (manually), and it didn't work out. After I deregistered the container instance, ECS deregistered (from ELB) the container instance before starting a new one, which may be unsafe for some (and certainly if doing your entire fleet in parallel)

schickling commented 7 years ago

@mattcallanan would you mind sharing your lambda function code?

seiffert commented 7 years ago

Hi @mattcallanan, this sounds pretty similar to what the article The Hook, the Message and the Function describes. We're using this setup in production very successfully.

mattcallanan commented 7 years ago

@schickling deregister_from_ecs_cluster() in the article @seiffert links to is very similar to our approach which in conjunction with ASG heartbeat worked well for orphaned tasks still being able to serve traffic behind an ELB and avoid outages.

But... in recent times, we've noticed that when an instance is deregistered from the cluster, the containers are automatically deregistered from the ELBs for their respective Services. The ECS documentation has been updated to reflect this: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/deregister_container_instance.html

Any containers in orphaned service tasks that are registered with a Classic Load Balancer or an Application Load Balancer target group are deregistered, and they will begin connection draining according to the settings on the load balancer or target group.

Unfortunately this means that outages are possible during cluster updates (depending on how many tasks there are per service and how quickly tasks can be launched on their ultimate new destination instances, etc.). For now, we're waiting on the instance draining feature mentioned on this thread. Having said that, suspending the termination for 20mins with the ASG lifecycle heartbeat is still useful to give containers a little extra time before the instance is terminated while they're being deregistered from their ELBs.

cbbarclay commented 7 years ago

Today we launched Container Instance Draining to address this use case.

ghost commented 7 years ago

Container instance draining is great news. @cbbarclay, how do we use this in CloudFormation? Let's say I have an autoscaling group with a launch configuration for container instances. If I change the AMI in the launch configuration and update the CloudFormation stack, then I would expect it to drain the container instances before replacing them, but it doesn't do that.

acmcelwee commented 7 years ago

@keycore-ho I'm trying to put together a plan for this, myself. The pattern I've mapped out so far is a variant of The Hook, the Message and the Function, where the lambdas would put the instance in a draining state, rather then deregistering from the elb and the ecs cluster.

kevinkarwaski commented 7 years ago

Folks, we have an update for that blog post we'll be publishing shortly. Since its writing, the ECS API changed which has made some of the old way we accomplished this obsolete.

danbf commented 7 years ago

could always just use a pre-stop script in the docker init, like upstart supports. that's a bit simpler then all the lifecycle/lamba stuff.

basically edit this after the docker install to add a pre-stop script: https://github.com/docker/docker/blob/master/contrib/init/upstart/docker.conf

acmcelwee commented 7 years ago

@danbf yeah, that's the route I initially thought, but I'm using ECS optimized AMIs with an ancient version of upstart (which it appears is the mechanism the ecs-agent is run) and docker running under SysV init. I think getting it right with that combo might be tricky.

acmcelwee commented 7 years ago

Aaaand, like clockwork, AWS publishes the blogpost with the variant I described above. It would be cool to see a tighter ECS/Autoscaling integration that just does this, but until that day, this works for me.

https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/

danbf commented 7 years ago

@acmcelwee my previous comment no longer works. now when I trigger a reboot, the docker pre-stop stuff happens as expected. but if I issue a terminate then it immediately seems to drop the node from ECS, which then kills the ELB registrations even with a docker upstart script with a pre-stop clause.

it seems all roads are moving to ASG lifecycle only usage. unless I override the terminate lifecycle hook which seems kinda not a great idea.

aws / amazon-ecs-agent

Functionality to remove one or many container instance(s) that runs tasks without service interuption #130