Open hwatts opened 5 years ago
Hi any updates on this? We're running into the same issue where rogue fargate tasks that have gone wrong somehow end up running forever.
We've tried including timeouts in the function itself to end, but for some reason this doesn't work and tasks continue to keep running.
Would also love to see this implemented!
Also running into a similar issue, the task doesn't stop even when the underlying process completes, only happens occasionally(< 1% of executions), but still important to not manually check every time if there are any rogue tasks lying around, or build another automation to stop them
To add to the scenario: "On rare occasions, these jobs can hang or take an excessive amount of time to complete, incurring cost and potentially impacting future schedules of the task. " in our case it can also block one our queues (as one of the couple of workers is blocked indefinitely)
Facing same problem here.
I'm running into the same problem. Step functions can submit ECS tasks but doesn't clean up tasks (even if a timeout is specified). I have to set up a relatively elaborate catch and cleanup in StepFunctions to clean up jobs that hang indefinitely - that would block further processing. It would be so much easier if we could just specify this stop-after-x-seconds value in ECS.
me, as well. I want this, plz
I am wrapping this up in a short blog post to add more context but I built a SF workflows that essentially kicks off, check if there is a tag TIMEOUT
associated to the task and if there is it waits n
seconds before sending a stopTask
(where n
is the value of the TIMEOUT
tag)
This is the CFN template that includes everything (SF workflow, EB rules, IAM roles, etc). There is nothing else to do: when the stack is deployed as-is all tasks launched in the account/region with a TIMEOUT
tag will be stopped after the value specified (in seconds).
Resources:
ecstaskrunning:
Type: AWS::Events::Rule
Properties:
EventPattern:
source:
- aws.ecs
detail-type:
- ECS Task State Change
detail:
lastStatus:
- RUNNING
desiredStatus:
- RUNNING
Targets:
- Id: !GetAtt tasktimeoutstatemachine.Name
Arn: !Ref tasktimeoutstatemachine
RoleArn: !GetAtt ecstaskrunningTotasktimeoutstatemachine.Arn
tasktimeoutstatemachine:
Type: AWS::Serverless::StateMachine
Properties:
Definition:
Comment: State machine to create/update a Route53 record
StartAt: ListTagsForResource
States:
ListTagsForResource:
Type: Task
Next: CheckTimeout
Parameters:
ResourceArn.$: $.resources[0]
ResultPath: $.listTagsForResource
Resource: arn:aws:states:::aws-sdk:ecs:listTagsForResource
CheckTimeout:
Type: Pass
Parameters:
timeoutexists.$: States.ArrayLength($.listTagsForResource.Tags[?(@.Key == TIMEOUT)])
ResultPath: $.timeoutconfiguration
Next: IsTimoutSet
IsTimoutSet:
Type: Choice
Choices:
- Variable: $.timeoutconfiguration.timeoutexists
NumericEquals: 1
Next: GetTimeoutValue
Default: Success
GetTimeoutValue:
Type: Pass
Parameters:
timeoutvalue.$: States.ArrayGetItem($.listTagsForResource.Tags[?(@.Key == TIMEOUT)].Value, 0)
ResultPath: $.timeoutconfiguration
Next: Wait
Success:
Type: Succeed
Wait:
Type: Wait
Next: StopTask
SecondsPath: $.timeoutconfiguration.timeoutvalue
StopTask:
Type: Task
Parameters:
Task.$: $.resources[0]
Cluster.$: $.detail.clusterArn
Resource: arn:aws:states:::aws-sdk:ecs:stopTask
End: true
Logging:
Level: ALL
IncludeExecutionData: true
Destinations:
- CloudWatchLogsLogGroup:
LogGroupArn: !GetAtt tasktimeoutstatemachineLogGroup.Arn
Policies:
- AWSXrayWriteOnlyAccess
- Statement:
- Effect: Allow
Action:
- ecs:ListTagsForResource
- ecs:StopTask
- logs:CreateLogDelivery
- logs:GetLogDelivery
- logs:UpdateLogDelivery
- logs:DeleteLogDelivery
- logs:ListLogDeliveries
- logs:PutResourcePolicy
- logs:DescribeResourcePolicies
- logs:DescribeLogGroups
Resource: '*'
Tracing:
Enabled: true
Type: STANDARD
tasktimeoutstatemachineLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub
- /aws/vendedlogs/states/${AWS::StackName}-${ResourceId}-Logs
- ResourceId: tasktimeoutstatemachine
ecstaskrunningTotasktimeoutstatemachine:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
Effect: Allow
Principal:
Service: !Sub events.${AWS::URLSuffix}
Action: sts:AssumeRole
Condition:
ArnLike:
aws:SourceArn: !Sub
- arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/${AWS::StackName}-${ResourceId}-*
- ResourceId: ecstaskrunning
Policies:
- PolicyName: StartExecutionPolicy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action: states:StartExecution
Resource: !Ref tasktimeoutstatemachine
Transform: AWS::Serverless-2016-10-31
I hear you the ideal solution would be native support for this capability in ECS but I am curious re whether an approach like this would work? In addition to having to pay extra for this Step Functions (I hear you, again), what are other reasons why this approach would not work Vs a timeout flag in the RunTask.
This is the blog post that gets into more context: https://it20.info/2023/03/configuring-a-timeout-for-amazon-ecs-tasks/
This looks similar to #572
@mreferre regarding your question
what are other reasons why this approach would not work Vs a timeout flag in the RunTask.
Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.
With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.
@jeroenhabets fair enough. Thanks!
@mreferre regarding your question
what are other reasons why this approach would not work Vs a timeout flag in the RunTask.
Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.
With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.
I would like to suggest another method, and that is using a sidecar container, all native inside ECS. Add a small essential container to your task definition which runs a sleep command and exits after a defined amount of time.
{
"family": "lifespan",
"networkMode": "awsvpc",
"requiresCompatibilities": [
"EC2",
"FARGATE"
],
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "nginx",
"image": "public.ecr.aws/nginx/nginx:mainline",
"essential": true
},
{
"name": "lifespan",
"image": "public.ecr.aws/docker/library/busybox:stable",
"essential": true,
"command": [
"sh",
"-c",
"sleep $TIMEOUT"
],
"environment": [
{
"name": "TIMEOUT",
"value": "60"
}
]
}
]
}
For a more detailed explanation I wrote this up
Also helps with #572
Would be interested in your feedback.
Please implement this
@maishsk same feedback for your workaround :
Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.
With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.
A potential workaround is to enforce the timeout in your own application code ENTRYPOINT
or CMD
calls.
For example, I am using the timeout
command with an environment variable that I set like this:
timeout ${TIMEOUT} python my_script.py
Definitely this feature could be quite helpful. Trying this now is a challenging. BTW in our use case we had to run tasks in different times so services are no an option.
This is similar to activeDeadlineSeconds
in Kubernetes.
Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to a number of seconds. The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.
https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
Tell us about your request An optional timeout parameter for the RunTask API
Which service(s) is this request for? Fargate, ECS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? As well as services, that are expected to be always in a running state, we also run scheduled tasks in ECS that are expected to complete various batch processes, then exit. On rare occasions, these jobs can hang or take an excessive amount of time to complete, incurring cost and potentially impacting future schedules of the task. An optional timeout parameter that's enforced by the ECS scheduler would help to manage these.
Are you currently working around this issue? Only by manually calling the StopTask API when we spot long running tasks.