[ECS] [request]: Limit concurrency of "scheduled" ECS Tasks

dsouzajude commented 5 years ago

Tell us about your request AWS ECS supports Scheduled Tasks allowing for the possibility to run an ECS task either periodically or on a cron schedule. However it is possible that if an ECS scheduled task is already running and has not ended, another instance of the same task can run based on the (next run of the) schedule.

The extended feature i'm proposing is for a way to limit concurrency i.e. to allow max N concurrent executions of an ECS scheduled task. If N=1, then there should always be one instance of the ECS scheduled task running at any time and on the next run, the scheduler should not schedule any other instance of the same task.

Furthermore, it would also be good to have an option for timeout in cases of reliability of the tasks such that if a task has been running for more than the timeout (probably due to bugs such as stuck in a loop), on the next run, the scheduler should terminate the currently running task and schedule a new instance of the task still respecting the concurrency parameters.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We're trying to run some ECS tasks on a periodic basis and some on a cron schedule. Some of these tasks have requirements that they should have only one instance of it running at any point in time, (for example such as offline payments to be made, or emails to be sent out). Ideally, we would not like the service to incorporate any kind of logic to deal with concurrency and race conditions and we would like the service to be as dumb and simple as possible doing only what it's supposed to do and let the infrastructure handle making sure that only one (if N=1) instance of the task would always be running at any given time. The reason is that, we have many cron type jobs running and for every cron job to incorporate such logic would make the system complex and difficult to maintain.

Are you currently working around this issue? We are currently running our cron jobs in Aurora which provides such a functionality in place. We are in the transition to move towards ECS. We have transitioned about 50% of our services, excluding batch jobs onto ECS and here we needed to think about how to solve the issue of limiting concurrency with cron jobs.

Some cron jobs that don't require any limits to concurrency are executed via AWS CloudWatch Event Rules where the ECS task is it's target.

We currently don't have a workaround, but our idea was to execute a lambda that actually executes the ECS Task via the ECS Task Api. The lambda would check for any currently running instance of this task and if it exists, it would decide to not run a new instance of the task. This lambda would be defined as a target to CloudWatch Events that would be triggered on the defined schedule. But we'd rather prefer this functionality to be built into the service.

Additional context More on our cron job: Our cron jobs are in the form of a docker image that is wrapped around by an ECS Task which is then executed by ECS.

wbingli commented 5 years ago

@dsouzajude Can maximumPercent in ECS DeploymentConfiguration help to ensure max concurrent tasks ? You can define 100 for maximumPercent and combine with desire count as 1 in ECS service, it will ensure only 1 task running at any given time. Beware that there could be time gap between new task and previous task.

dsouzajude commented 5 years ago

Hi @wbingli if i'm not mistaken, DeploymentConfiguration only works with ECS Services, not ECS Tasks. Here i'm talking about Scheduled Tasks [1], in particular the section about "Running Tasks on a cron-like Schedule". I want to control the amount of concurrent scheduled tasks of the same task definition from running at any single point in time.

[1] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html

CpuID commented 4 years ago

This would be super valuable for use with Scheduled ECS Tasks. There's a lot of risk with the way it's implemented currently, my larger concern is tasks that run indefinitely with more being scheduled every X duration, eventually filling cluster resources.

CpuID commented 4 years ago

bump any movement here? :)

justin-wesley commented 4 years ago

You can always implement some sort of locking mechanism to ensure that only one process is running. I have implemented a file lock process where the application puts a "lock" file in an S3 bucket at the beginning of a run, and deletes it at the end. If that file already exists, the the task will end immediately. You could do this with S3 or DynamoDB to create that "lock" to ensure the task runs as a singleton instance.

Not the sexiest approach, but works for our use case. You can even log whenever you encounter a lock and set an alarm on that message if it is occurring over a given period of time.

agouz commented 4 years ago

@justin-wesley thanks for sharing your solution, just one question what happens if two threads tried to check the bucket at the same time and they both found nothing, then both will put a lock and proceed, right?

stuartf commented 4 years ago

@agouz we do this using DynamoDB because you can do an atomic action that will check for the lock and create it if it doesn't already exist. There are even libraries around for various languages like https://github.com/awslabs/dynamodb-lock-client

deuscapturus commented 4 years ago

I need this feature now. It would make sense to add this feature as a Placement Constraint on the Task Group.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-placement-constraints.html

mrtristan commented 4 years ago

my c# concurrency check that I run on startup:

var tasksResponse = await ecsClient.ListTasksAsync(new ListTasksRequest
{
    Cluster = configuration["ClusterName"]
});

if (tasksResponse.TaskArns.Count() > 1)
{
    Log.Error("Another instance already running, quitting...");
    return;
}

also need to set up the app role with access to query for the ecs data, works well. only flaw would be if two boot at the same moment, they'll both see each other and both quit. that's not a possibility in my workflow though.

justin-wesley commented 4 years ago

@justin-wesley thanks for sharing your solution, just one question what happens if two threads tried to check the bucket at the same time and they both found nothing, then both will put a lock and proceed, right?

I apologize as I didn't see your question until now. To successfully continue processing, you not only have to validate the file doesn't exist, but successfully write a file to the bucket. If you happen to have two instances trying to write the file at the same time, one of them will fail. That would then be understood as the process being locked.

FernandoMiguel commented 4 years ago

@justin-wesley thanks for sharing your solution, just one question what happens if two threads tried to check the bucket at the same time and they both found nothing, then both will put a lock and proceed, right?

I apologize as I didn't see your question until now. To successfully continue processing, you not only have to validate the file doesn't exist, but successfully write a file to the bucket. If you happen to have two instances trying to write the file at the same time, one of them will fail. That would then be understood as the process being locked.

@justin-wesley S3 is eventually consistent. I would not trust it to be atomic

justin-wesley commented 4 years ago

@justin-wesley thanks for sharing your solution, just one question what happens if two threads tried to check the bucket at the same time and they both found nothing, then both will put a lock and proceed, right?

I apologize as I didn't see your question until now. To successfully continue processing, you not only have to validate the file doesn't exist, but successfully write a file to the bucket. If you happen to have two instances trying to write the file at the same time, one of them will fail. That would then be understood as the process being locked.

@justin-wesley S3 is eventually consistent. I would not trust it to be atomic Great point when a task can be triggered multiple ways and can possibly be triggered at the same time.

Our process doesn't need to be atomic as it is triggered by a schedule. We just need to make sure the previous process is finished before starting a new one.

toindev commented 4 years ago

Can you use something like a service that can scale to 0?

Service with desired count at 0, and 1 max
Scheduled event that creates a SQS message
Scaling event when messages are visible on a queue, to add a service instance
Service stops after completing the task, and deletes the SQS message
Service is back to the desired 0 instances

I think it could work, and avoid overlapping of "tasks". Maybe the scaling event itself could be directly scheduled, to avoid the whole SQS deal.

(woops, realized that's probably what @wbingli meant)

CpuID commented 4 years ago

Can you use something like a service that can scale to 0?

Service with desired count at 0, and 1 max

Scheduled event that creates a SQS message

Scaling event when messages are visible on a queue, to add a service instance

Service stops after completing the task, and deletes the SQS message

Service is back to the desired 0 instances

I agree that could "technically" work. The UX feels a bit convoluted, but its definitely a technical workaround...

Would much prefer a native solution if possible though.

kendrexs commented 4 years ago

You can maintain maximum concurrency in ECS by using AWS Batch. Since Batch is built on ECS, it's easy to translate your jobs from a task definition to a job definition. Batch has the capability to restrict the number of vCPU's you have (max vCPU) so you can only get that number of jobs running at any time. Documentation here.

srrengar commented 4 years ago

Thanks @kendrexs. Closing this issue since AWS Batch should be used for this.

Ray-B commented 3 years ago

Thanks @kendrexs. Closing this issue since AWS Batch should be used for this.

This should not have been closed. AWS Batch is not a suitable solution to the problem described. AWS Batch is an entirely separate managed service. Whether it is capable of leveraging ECS compute under the hood is irrelevant. The problem described pertains specifically to the current implementation of cron scheduling for ECS tasks. Everything posted so far is a workaround at best.

Fundamentally the current implementation presents an insidiously dangerous risk for runaway tasks and eventual resource exhaustion on clusters.

A good example of where this can happen are scheduled tasks that function as canaries. If the canary task runs in vpc networking mode, unexpected overlapping of task executions can consume all available ENIs on the host and cause unrelated tasks to fail.

You could also end up in situations where tasks (making use of larger docker images) get spun up when not required (perhaps due to some sort of error in your implementation that causes the task to not timeout quickly, say) and you end up using a ton of cross-zone data transfer, e.g. USE1-USW2-AWS-Out-Bytes, as lots of people host modified docker images in one centralized region to simplify management, and then fan out that image to other active regions. This is a fairly standard approach.

Please re-open this discussion. @coultn

deuscapturus commented 3 years ago

I completely agree with @Ray-B

The best feature of ECS (over Kubernetes) is that Task Definitions are define once and run multiple ways - incredibly useful for event driven architecture. By not solving this problem AWS is doing a great disservice to ECS adopters.

deyvsh commented 3 years ago

Interested to hear any experiences of using AWS Batch to achieve this, despite the objections. Also, see https://github.com/aws/containers-roadmap/issues/572.

deuscapturus commented 3 years ago

I tried to migrate from ECS Fargate to Batch for my scheduled task to gain concurrency limitation and was blocked on the lack of firelens support. Batch job definitions can only define one container vs. task definitions which can define multiple containers in a task. If we are going to be forced to use Batch it would be very useful if we could use existing ECS Task Definitions instead the incomplete abstraction of a Batch job definition.

gregorydickson commented 3 years ago

I have found that when triggering an ECS task from a Cloudwatch rule/cron, if the task is already running, it does not create another instance/container of the task, even though the ECS task time started is updated to the time the rule executed. This is on Fargate Platform version 1.4

binyominc commented 3 years ago

Using the aws sdk for whatever language your code is written in, you can see the number of tasks running and act based on that in your code. You can see there are already n number of tasks, and determine, ok let me exit now from this one (meaning you shouldn't need a file lock or dynamo to keep track of this.) The equivalent of the cli command aws ecs list-tasks --cluster mycluster --profile myprofile --region us-east-2 If you have other things in your cluster, then specify task family https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html

mikebollandajw commented 2 years ago

When will this be resolved?

I have found that when triggering an ECS task from a Cloudwatch rule/cron, if the task is already running, it does not create another instance/container of the task, even though the ECS task time started is updated to the time the rule executed. This is on Fargate Platform version 1.4

This is not the case for me

jpradelle commented 1 year ago

I would have the need too, still no other solution than heavy workarounds ?

konarskis commented 1 year ago

Why is this still closed?

AdrianHL commented 1 year ago

Totally agree on this one. There should be a way to manage this limit.

hareeshi commented 1 year ago

Sharing a solution I created for this request using Step Functions - Run Amazon ECS Scheduled Tasks with Concurrency Limits. Though I agree that having a native solution is ideal, I see this approach as cost effective and with minimal operational overhead.

Would love to get your thoughts and feedback.

nagarjun-aample commented 8 months ago

Any updates on this?

We would appreciate if you could open this and consider providing native solution rather than using other services to achieve the same

aws / containers-roadmap

[ECS] [request]: Limit concurrency of "scheduled" ECS Tasks #232