Support Task execution timeout(maximum-lifetime for a container) in ECS fargate

nitheesha-amzn commented 5 years ago

Summary

ECS does not currently support a task execution timeout so that when a task exceeds more than certain period of time, the task must be stopped automatically like how AWS Batch has job timeouts. The task definition does not have a parameter to enforce a task/container execution timeout that will automatically trigger the container to stop after the set time.

Use-case example from a customer: I have a NLP model training job I want to run in a fargate container triggered by a lambda function. At some time, a bug might be introduced in the training code that would cause it to run indefinitely. I don't want to accidentally have those tasks piling up and have 50 tasks running for a couple weeks before we notice. That could have a cost implication. Is there a native way to kill a container if it hasn't exited on its own before a certain time?

Can this be considered as a feature request?

danieladams456 commented 5 years ago

Thanks @nitheesha-amzn for submitting this for me! As we discussed in the ticket, a more native approach would be to have AWS Batch support Fargate launch type. This seems to be kind of a force-fit edge case for ECS.

adnxn commented 5 years ago

moving this over to the container roadmaps as an ecs feature request

CpuID commented 4 years ago

I can see another use case here, as mentioned in https://github.com/aws/containers-roadmap/issues/232

applies to both Fargate and EC2 execution methods for ECS, not just Fargate
when scheduling tasks using cron-style syntax with Cloudwatch Events/EventBridge, you would want to ensure tasks don't run indefinitely. if they did, and you have them set to spawn regularly, you would eventually exhaust cluster resources/service limits, effectively DoS'ing your AWS account

apsoto commented 4 years ago

A problem I'm seeing is a task that is expected to be relatively short lived (few hours at most, but typically minutes) due to some bug is 'stuck' and still running after days.

It would be great to have a back stop that kills any jobs after X hours. Looking at the console with hundreds of tasks is hard to find the problem ones.

rcollette commented 4 years ago

Would like to stop a bastion host after a period of time.

CpuID commented 4 years ago

@adnxn any updates re where this sits on the roadmap? :)

max-grosch commented 4 years ago

+1

adnxn commented 4 years ago

any updates re where this sits on the roadmap? :)

/ping @coultn

CraigHead commented 4 years ago

Bump! 🥓

vbarba commented 4 years ago

What do you think about adding a "essential" container to the task with a sleep XX. When the sleep ends ECS will stop the task then.

deyvsh commented 3 years ago

Interested to hear any experiences of using AWS Batch to achieve this, despite the objections. Also, see https://github.com/aws/containers-roadmap/issues/232.

sky4git commented 3 years ago

I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).

I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.

gregorydickson commented 3 years ago

@sky4git An AWS Step Function State Machine is one solution to your use case. It can 'monitor' and take actions based on a time window. Also you can create CloudWatch alarms to monitor failed executions and timeouts.

I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).

I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.

paolofulgoni commented 3 years ago

Here is another use case which would benefit from this requested feature: We run end-to-end tests on an ECS Task with Fargate. If, for a bug, a test gets stuck, the task could potentially run forever. I haven't found any way to set a CloudWatch alarm for task duration.

rektide commented 3 years ago

I'd love this feature.

Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.

gregorydickson commented 3 years ago

I'd love this feature.

Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.

@rektide I think that you could use a Step Function State Machine to set a max time and shut down the ECS task.

citrusoft commented 3 years ago

+1

TarekAS commented 3 years ago

Until this is natively implemented in ECS Scheduled Tasks, here are some options you have to implement timeouts:

Wrap the command of your job container with timeout (assuming it's available in the container). e.g. timeout X mycommand arg1 arg2; STATUS=$?; if [ $STATUS -eq 124 ]; then echo 'Job Timed Out!'; fi; exit $STATUS
Add an essential container to the task definition with command sleep X. When it times out, the whole task exits.
Use external entities (such as Step Functions) to monitor and stop tasks that exceed a max lifetime.
Just add a CloudWatch Alarm that notifies you when some tasks have ran for too long, and stop them manually.
Use Kubernetes instead of ECS. Seriously, no native timeouts on scheduled tasks?

tim-x-y-z commented 2 years ago

This would be a great feature!

rahul799 commented 1 year ago

@TarekAS which metrices did you used to set the Cloudwatch Alarm?

ciurlaro42 commented 1 year ago

I don't understand how is it possible that such a basic feature is not available

dims commented 1 year ago

cc @ofiliz

mreferre commented 1 year ago

This is a way to introduce a timeout for ECS tasks. Feedback welcome.

https://it20.info/2023/03/configuring-a-timeout-for-amazon-ecs-tasks/

jeroenhabets commented 1 year ago

@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !

citrusoft commented 1 year ago

Classification: Public

Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.

From: Jeroen Habets @.> Date: Monday, March 20, 2023 at 12:08 PM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)

CAUTION: EXTERNAL SENDER!

This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.

@mreferrehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmreferre&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ohqfmLgy9g2Xr1DsTAqNfvXOecT%2FIvqfZUC%2FZf2lpqM%3D&reserved=0 thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your bloghttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit20.info%2F2023%2F03%2Fconfiguring-a-timeout-for-amazon-ecs-tasks%2F&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=d4TVI5B3L74leTFNwKz5pOSJrMK%2Bwdm1X4YTf3IeVfo%3D&reserved=0, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1476790562&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=w7AmeQedkOk3rkUaRIJpV60EmF5hO%2FH%2BR0DTvZntWh0%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MAK3T7F2QZ5SW744INDW5CTJFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZvrlPvsGSF1p76S073%2BVvA2Unj7hVnEPyq0r%2BCm1Jls%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.

mreferre commented 1 year ago

@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !

Thanks!

mreferre commented 1 year ago

Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.

Thanks Tommy. Do you mean something like this?

citrusoft commented 1 year ago

Classification: Public

Yes sir, purr…fect!

From: Massimo Re Ferrè @.> Date: Tuesday, March 21, 2023 at 2:17 AM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)

CAUTION: EXTERNAL SENDER!

This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.

Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.

Thanks Tommy. Do you mean something like thishttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Faws.amazon.com%2Fblogs%2Fcontainers%2Fgraceful-shutdowns-with-ecs%2F&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BDfEOls9Sg%2FwNclfBxFuT52F0cRJVqg%2FexHYURZ7V2o%3D&reserved=0?

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1477498910&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Sg%2FqKaqiizhje5ed4v3UbdmIbKQxjXGwOi4pgcNJRkU%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MALFJMMERKFYLPZXJCDW5FWZFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Po0zJAMLo2w6ESR18SKUyt44YN63QI%2B7dEReoXwB7iI%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.

larstobi commented 1 year ago

When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.

I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.

mreferre commented 1 year ago

When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.

I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.

@larstobi, that's (more or less) how ECS services work natively. When you create a service with n tasks in it a re-deployment will make sure (with a certain amount of knobs/configurations) that your service never goes down. Trying to orchestrate this with standalone runTask api calls is possible but not easy (especially when there is a configuration that does this for you out of the box).

The problem of the timeout is more for batch type of workloads where you launch them and you know they are going to take a certain amount of time to complete and you want to make sure that they complete without remaining pending.

ghomem commented 1 year ago

I would like to support this feature request. It is a valid and necessary use case.

seinshah commented 10 months ago

I observed another weird behavior that implies a higher level timeout on ECS task run is necessary.

Created ECS task definition X
Created EventBridge rule to run every minute and run a task based on X:1
After a while I wanted to disable this
Removed task definition X and EventBridge rule from terraform at the same time
EventBridge rule was evidently active for a while after this operation and was still trying to run the task every minute
Although task definition didn't exist any more, that task was running indefinitely with an inactive label next to the task definition column. It was being triggered by the EventBridge. It spawned multiple task until the rule was eventually deleted as well.

This shows that if there was a way to enforce the timeout on task run or EventBridge execution level, we would have avoided running task with inactive task definition indefinitely.

usn-devops commented 8 months ago

+1 We started running cron jobs in ECS and we have some of them that will fail/time-out for some reason and they end up running indefinitely. We'd like to avoid this scenario with a max timeout.

aws / containers-roadmap

Support Task execution timeout(maximum-lifetime for a container) in ECS fargate #572

Summary