aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[ECS] [request]: Timeout for RunTask #291

Open hwatts opened 5 years ago

hwatts commented 5 years ago

Tell us about your request An optional timeout parameter for the RunTask API

Which service(s) is this request for? Fargate, ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? As well as services, that are expected to be always in a running state, we also run scheduled tasks in ECS that are expected to complete various batch processes, then exit. On rare occasions, these jobs can hang or take an excessive amount of time to complete, incurring cost and potentially impacting future schedules of the task. An optional timeout parameter that's enforced by the ECS scheduler would help to manage these.

Are you currently working around this issue? Only by manually calling the StopTask API when we spot long running tasks.

opqpop commented 3 years ago

Hi any updates on this? We're running into the same issue where rogue fargate tasks that have gone wrong somehow end up running forever.

We've tried including timeouts in the function itself to end, but for some reason this doesn't work and tasks continue to keep running.

Hc747 commented 3 years ago

Would also love to see this implemented!

palharsh commented 2 years ago

Also running into a similar issue, the task doesn't stop even when the underlying process completes, only happens occasionally(< 1% of executions), but still important to not manually check every time if there are any rogue tasks lying around, or build another automation to stop them

jeroenhabets commented 2 years ago

To add to the scenario: "On rare occasions, these jobs can hang or take an excessive amount of time to complete, incurring cost and potentially impacting future schedules of the task. " in our case it can also block one our queues (as one of the couple of workers is blocked indefinitely)

otavioribeiromedeiros commented 1 year ago

Facing same problem here.

mikedorfman commented 1 year ago

I'm running into the same problem. Step functions can submit ECS tasks but doesn't clean up tasks (even if a timeout is specified). I have to set up a relatively elaborate catch and cleanup in StepFunctions to clean up jobs that hang indefinitely - that would block further processing. It would be so much easier if we could just specify this stop-after-x-seconds value in ECS.

ewascent commented 1 year ago

me, as well. I want this, plz

mreferre commented 1 year ago

I am wrapping this up in a short blog post to add more context but I built a SF workflows that essentially kicks off, check if there is a tag TIMEOUT associated to the task and if there is it waits n seconds before sending a stopTask (where n is the value of the TIMEOUT tag)

This is the CFN template that includes everything (SF workflow, EB rules, IAM roles, etc). There is nothing else to do: when the stack is deployed as-is all tasks launched in the account/region with a TIMEOUT tag will be stopped after the value specified (in seconds).

Resources:
  ecstaskrunning:
    Type: AWS::Events::Rule
    Properties:
      EventPattern:
        source:
          - aws.ecs
        detail-type:
          - ECS Task State Change
        detail:
          lastStatus:
            - RUNNING
          desiredStatus:
            - RUNNING
      Targets:
        - Id: !GetAtt tasktimeoutstatemachine.Name
          Arn: !Ref tasktimeoutstatemachine
          RoleArn: !GetAtt ecstaskrunningTotasktimeoutstatemachine.Arn
  tasktimeoutstatemachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      Definition:
        Comment: State machine to create/update a Route53 record
        StartAt: ListTagsForResource
        States:
          ListTagsForResource:
            Type: Task
            Next: CheckTimeout
            Parameters:
              ResourceArn.$: $.resources[0]
            ResultPath: $.listTagsForResource
            Resource: arn:aws:states:::aws-sdk:ecs:listTagsForResource
          CheckTimeout:
            Type: Pass
            Parameters:
              timeoutexists.$: States.ArrayLength($.listTagsForResource.Tags[?(@.Key == TIMEOUT)])
            ResultPath: $.timeoutconfiguration
            Next: IsTimoutSet
          IsTimoutSet:
            Type: Choice
            Choices:
              - Variable: $.timeoutconfiguration.timeoutexists
                NumericEquals: 1
                Next: GetTimeoutValue
            Default: Success
          GetTimeoutValue:
            Type: Pass
            Parameters:
              timeoutvalue.$: States.ArrayGetItem($.listTagsForResource.Tags[?(@.Key == TIMEOUT)].Value, 0)
            ResultPath: $.timeoutconfiguration
            Next: Wait
          Success:
            Type: Succeed
          Wait:
            Type: Wait
            Next: StopTask
            SecondsPath: $.timeoutconfiguration.timeoutvalue
          StopTask:
            Type: Task
            Parameters:
              Task.$: $.resources[0]
              Cluster.$: $.detail.clusterArn
            Resource: arn:aws:states:::aws-sdk:ecs:stopTask
            End: true
      Logging:
        Level: ALL
        IncludeExecutionData: true
        Destinations:
          - CloudWatchLogsLogGroup:
              LogGroupArn: !GetAtt tasktimeoutstatemachineLogGroup.Arn
      Policies:
        - AWSXrayWriteOnlyAccess
        - Statement:
            - Effect: Allow
              Action:
                - ecs:ListTagsForResource
                - ecs:StopTask
                - logs:CreateLogDelivery
                - logs:GetLogDelivery
                - logs:UpdateLogDelivery
                - logs:DeleteLogDelivery
                - logs:ListLogDeliveries
                - logs:PutResourcePolicy
                - logs:DescribeResourcePolicies
                - logs:DescribeLogGroups
              Resource: '*'
      Tracing:
        Enabled: true
      Type: STANDARD
  tasktimeoutstatemachineLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub
        - /aws/vendedlogs/states/${AWS::StackName}-${ResourceId}-Logs
        - ResourceId: tasktimeoutstatemachine
  ecstaskrunningTotasktimeoutstatemachine:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          Effect: Allow
          Principal:
            Service: !Sub events.${AWS::URLSuffix}
          Action: sts:AssumeRole
          Condition:
            ArnLike:
              aws:SourceArn: !Sub
                - arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/${AWS::StackName}-${ResourceId}-*
                - ResourceId: ecstaskrunning
      Policies:
        - PolicyName: StartExecutionPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action: states:StartExecution
                Resource: !Ref tasktimeoutstatemachine
Transform: AWS::Serverless-2016-10-31

I hear you the ideal solution would be native support for this capability in ECS but I am curious re whether an approach like this would work? In addition to having to pay extra for this Step Functions (I hear you, again), what are other reasons why this approach would not work Vs a timeout flag in the RunTask.

mreferre commented 1 year ago

This is the blog post that gets into more context: https://it20.info/2023/03/configuring-a-timeout-for-amazon-ecs-tasks/

maishsk commented 1 year ago

This looks similar to #572

jeroenhabets commented 1 year ago

@mreferre regarding your question

what are other reasons why this approach would not work Vs a timeout flag in the RunTask.

Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.

With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.

mreferre commented 1 year ago

@jeroenhabets fair enough. Thanks!

maishsk commented 1 year ago

@mreferre regarding your question

what are other reasons why this approach would not work Vs a timeout flag in the RunTask.

Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.

With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.

I would like to suggest another method, and that is using a sidecar container, all native inside ECS. Add a small essential container to your task definition which runs a sleep command and exits after a defined amount of time.

{
  "family": "lifespan",
  "networkMode": "awsvpc",
  "requiresCompatibilities": [
    "EC2",
    "FARGATE"
  ],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "nginx",
      "image": "public.ecr.aws/nginx/nginx:mainline",
      "essential": true
    },
    {
      "name": "lifespan",
      "image": "public.ecr.aws/docker/library/busybox:stable",
      "essential": true,
      "command": [
        "sh",
        "-c",
        "sleep $TIMEOUT"
      ],
      "environment": [
        {
          "name": "TIMEOUT",
          "value": "60"
        }
      ]
    }
  ]
}

For a more detailed explanation I wrote this up

Also helps with #572

Would be interested in your feedback.

calebplum commented 1 year ago

Please implement this

jeroenhabets commented 1 year ago

@maishsk same feedback for your workaround :

Many turn to AWS and services ECS to handle most of their hosting complexities in order to be able to focus on where they can deliver the most value. So "it" may technically work (i.e. step functions approach 🚀 ) but introduces needless complexity (over timeout flag), not only in implementation but also maintenance and support.

With in total almost 250 👍 let's hope the ECS team can deliver this (sub)feature sometime soon.

plurch commented 10 months ago

A potential workaround is to enforce the timeout in your own application code ENTRYPOINT or CMD calls.

For example, I am using the timeout command with an environment variable that I set like this:

timeout ${TIMEOUT} python my_script.py
ldipotetjob commented 7 months ago

Definitely this feature could be quite helpful. Trying this now is a challenging. BTW in our use case we had to run tasks in different times so services are no an option.

trallnag commented 1 week ago

This is similar to activeDeadlineSeconds in Kubernetes.

Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to a number of seconds. The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.

https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup