[ECS] [request/bug?]: Add scale in protection on hosts running a task via RunTask

eriko-de commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request When using a ECS cluster with EC2 auto scaling group as capacity provider. When starting tasks via RunTask actions on ECS the EC2 instance, where the task gets placed on, should be protected from scale in. Those are markes as protected from scale in, but still gets stopped when auto scaling group is scaling down/in.

Which service(s) is this request for? ECS with EC2 ASG as capacity provider

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We are using ECS to manage our background processors (Rails app with resque workers running inside a task). We have two kind of jobs, short-running interrupt able or repeatable jobs (image processing) and long running not interrupt able jobs (video processing and streaming). The short running jobs are managed by a ECS Service.

As we can't tell ECS Services, which jobs should be stopped, when scale in, we needed to implement our own 'scaling logic' for long running jobs. We use RunTask for scheduling new workers and stop tasks by them self, when scaling down.

Bug or unexpected behavior: When we start a new task via RunTask action, we would expect the instances, where the task gets started on to be marked as protected from scale in, but it doesn't.

Are you currently working around this issue? We manually observe the task count and start additional tasks, if any task got stopped due to the termination of the underlying EC2 host.

Additional context We would not need to use our own scaling logic via RunTask, if scale in for services would be controllable, see: https://github.com/aws/containers-roadmap/issues/125

coultn commented 3 years ago

Have you enabled managed termination protection? https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_AutoScalingGroupProvider.html

eriko-de commented 3 years ago

Yes we have. This is our terraform template (not sure if it helps)

resource "aws_ecs_capacity_provider" "capacity_provider" {
  name = "${var.prefix}-CapacityProvider"

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.resque_worker_auto_scaling_group.arn
    managed_termination_protection = "ENABLED"

    managed_scaling {
      maximum_scaling_step_size = 4
      minimum_scaling_step_size = 1
      status                    = "ENABLED"
      target_capacity           = 70
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

And the Webgui also shows: Managed Instance Protection: Yes

... but still sometimes tasks gets started on instances via RunTask action, where the instance is not protected from scale in.

eriko-de commented 3 years ago

Maybe this is the wrong place to discuss, but the scale in protection works kind of different as I would expect it from the documentation

The instances, which gets started via launch template of the auto scaling group, registering them self to the cluster and it seems that only some of the instances or only after some time gets marked as protected from scale in.

I would have expected that the protected from scale in flag is only assigned as soon as the first task gets started on the instance and it would be removed as soon as the last task on the host gets stopped.

Currently it feels like ECS is only removing the scale in protection, when it tries to scale down the cluster.

mdomsch-seczetta commented 1 year ago

I'd prefer to have ECS manage the "protect from scale-in" flag on an EC2 instance. Until then, I added a call in my code's startup handler chain to set "protect from scale-in", and another in the shutdown handler chain to remove that protection, along with setting the corresponding ECS Container Instance state to DRAINING.

aws / containers-roadmap

[ECS] [request/bug?]: Add scale in protection on hosts running a task via RunTask #1207

Community Note