Open nitheesha-amzn opened 5 years ago
Thanks @nitheesha-amzn for submitting this for me! As we discussed in the ticket, a more native approach would be to have AWS Batch support Fargate launch type. This seems to be kind of a force-fit edge case for ECS.
moving this over to the container roadmaps as an ecs feature request
I can see another use case here, as mentioned in https://github.com/aws/containers-roadmap/issues/232
A problem I'm seeing is a task that is expected to be relatively short lived (few hours at most, but typically minutes) due to some bug is 'stuck' and still running after days.
It would be great to have a back stop that kills any jobs after X hours. Looking at the console with hundreds of tasks is hard to find the problem ones.
Would like to stop a bastion host after a period of time.
@adnxn any updates re where this sits on the roadmap? :)
+1
any updates re where this sits on the roadmap? :)
/ping @coultn
Bump! 🥓
What do you think about adding a "essential" container to the task with a sleep XX
. When the sleep ends ECS will stop the task then.
Interested to hear any experiences of using AWS Batch to achieve this, despite the objections. Also, see https://github.com/aws/containers-roadmap/issues/232.
I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).
I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.
@sky4git An AWS Step Function State Machine is one solution to your use case. It can 'monitor' and take actions based on a time window. Also you can create CloudWatch alarms to monitor failed executions and timeouts.
I wonder if this can be done through AWS Config Rule. Event bridge cron rule will also do the same. I guess. Run a lambda function through a config rule every hour to stop containers started before certain time (i.e for an hour ago).
I have the same issue, want to stop container after an hour, but not sure how to do it. I need to do as part of several different stacks so cluster and task ids will be different. Best if its part of task definition, otherwise need to create a config rule to target all the different cluster/tasks.
Here is another use case which would benefit from this requested feature: We run end-to-end tests on an ECS Task with Fargate. If, for a bug, a test gets stuck, the task could potentially run forever. I haven't found any way to set a CloudWatch alarm for task duration.
I'd love this feature.
Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.
I'd love this feature.
Some of our tasks leak memory very slowly. It'd be great to be able to set a maximum task life of ~3 months, to keep the memory leakage small. In general, it seems like a modern best practice to reap your processes fairly early, to not allow very long-lived processes in your systems. It would be great if Fargate ECS could assist with this. We would also love it if regular ECS supported this.
@rektide I think that you could use a Step Function State Machine to set a max time and shut down the ECS task.
+1
Until this is natively implemented in ECS Scheduled Tasks, here are some options you have to implement timeouts:
timeout
(assuming it's available in the container).
e.g. timeout X mycommand arg1 arg2; STATUS=$?; if [ $STATUS -eq 124 ]; then echo 'Job Timed Out!'; fi; exit $STATUS
sleep X
. When it times out, the whole task exits.This would be a great feature!
@TarekAS which metrices did you used to set the Cloudwatch Alarm?
I don't understand how is it possible that such a basic feature is not available
cc @ofiliz
This is a way to introduce a timeout for ECS tasks. Feedback welcome.
https://it20.info/2023/03/configuring-a-timeout-for-amazon-ecs-tasks/
@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !
Classification: Public
Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.
From: Jeroen Habets @.> Date: Monday, March 20, 2023 at 12:08 PM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)
CAUTION: EXTERNAL SENDER!
This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.
@mreferrehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmreferre&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ohqfmLgy9g2Xr1DsTAqNfvXOecT%2FIvqfZUC%2FZf2lpqM%3D&reserved=0 thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your bloghttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fit20.info%2F2023%2F03%2Fconfiguring-a-timeout-for-amazon-ecs-tasks%2F&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=d4TVI5B3L74leTFNwKz5pOSJrMK%2Bwdm1X4YTf3IeVfo%3D&reserved=0, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !
— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1476790562&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=w7AmeQedkOk3rkUaRIJpV60EmF5hO%2FH%2BR0DTvZntWh0%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MAK3T7F2QZ5SW744INDW5CTJFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C7f4a769e8a3f4bc224e908db29766de5%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149360852571143%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZvrlPvsGSF1p76S073%2BVvA2Unj7hVnEPyq0r%2BCm1Jls%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>
You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.
@mreferre thanks for sharing! Though home-grown workarounds are always possible and it's nice to see a cost effective one described in your blog, we, and I'm sure many others, will wait for ECS itself to support such timeouts before migrating our applicable workloads over to ECS. Again: thanks for sharing as I'm also confident it will help some others 🚀 !
Thanks!
Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.
Thanks Tommy. Do you mean something like this?
Classification: Public
Yes sir, purr…fect!
From: Massimo Re Ferrè @.> Date: Tuesday, March 21, 2023 at 2:17 AM To: aws/containers-roadmap @.> Cc: Hunt, Tommy @.>, Comment @.> Subject: Re: [aws/containers-roadmap] Support Task execution timeout(maximum-lifetime for a container) in ECS fargate (#572)
CAUTION: EXTERNAL SENDER!
This email was sent from an EXTERNAL source. Do you know this person? Are you expecting this email? Are you expecting any links or attachments? If suspicious, do not click links, open attachments, or provide credentials. Don't delete it. Report it by using the "Report Phish" button.
Classification: Public Nice job, the article could be enhanced to point the developer to an article/tutorial teaching how the executable could catch the event/signal for a graceful termination.
— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faws%2Fcontainers-roadmap%2Fissues%2F572%23issuecomment-1477498910&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Sg%2FqKaqiizhje5ed4v3UbdmIbKQxjXGwOi4pgcNJRkU%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA45MALFJMMERKFYLPZXJCDW5FWZFANCNFSM4JMIM6PA&data=05%7C01%7Ctahv%40pge.com%7C275ae5867f2d43a236e408db29ed0b67%7C44ae661aece641aabc967c2c85a08941%7C0%7C0%7C638149870350709307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Po0zJAMLo2w6ESR18SKUyt44YN63QI%2B7dEReoXwB7iI%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>
You can read about PG&E’s data privacy practices herehttps://www.pge.com/en_US/about-pge/company-information/privacy-policy/privacy.page or at PGE.com/privacyhttps://www.PGE.com/privacy.
When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.
I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.
When running ECS services with many Fargate tasks per service, we want to be sure that new tasks are able to start successfully and stay healthy for a while, before terminating older Fargate tasks. So, just having tasks killed off after a certain time without checking that new tasks can start will cause downtime.
I think maybe tasks can be freshened up by using scheduled auto scaling events. So, scale up and wait a bit for the new tasks to be stable, and then scale down. Hopefully ECS will stop the older tasks first. Result: a new set of fresh tasks.
@larstobi, that's (more or less) how ECS services work natively. When you create a service with n
tasks in it a re-deployment will make sure (with a certain amount of knobs/configurations) that your service never goes down. Trying to orchestrate this with standalone runTask api calls is possible but not easy (especially when there is a configuration that does this for you out of the box).
The problem of the timeout is more for batch type of workloads where you launch them and you know they are going to take a certain amount of time to complete and you want to make sure that they complete without remaining pending.
I would like to support this feature request. It is a valid and necessary use case.
I observed another weird behavior that implies a higher level timeout on ECS task run is necessary.
inactive
label next to the task definition column. It was being triggered by the EventBridge. It spawned multiple task until the rule was eventually deleted as well.This shows that if there was a way to enforce the timeout on task run or EventBridge execution level, we would have avoided running task with inactive task definition indefinitely.
+1 We started running cron jobs in ECS and we have some of them that will fail/time-out for some reason and they end up running indefinitely. We'd like to avoid this scenario with a max timeout.
Summary
ECS does not currently support a task execution timeout so that when a task exceeds more than certain period of time, the task must be stopped automatically like how AWS Batch has job timeouts. The task definition does not have a parameter to enforce a task/container execution timeout that will automatically trigger the container to stop after the set time.
Use-case example from a customer: I have a NLP model training job I want to run in a fargate container triggered by a lambda function. At some time, a bug might be introduced in the training code that would cause it to run indefinitely. I don't want to accidentally have those tasks piling up and have 50 tasks running for a couple weeks before we notice. That could have a cost implication. Is there a native way to kill a container if it hasn't exited on its own before a certain time?
Can this be considered as a feature request?