aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[Batch] [ECS] [request]: Allow ECS Anywhere clusters to be used with Batch #1557

Open okofish opened 3 years ago

okofish commented 3 years ago

Community Note

Tell us about your request What do you want us to build? I'd like to be able to use ECS Anywhere clusters in unmanaged Batch compute environments

Which service(s) is this request for? AWS Batch and ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Coupling the AWS Batch control plane with on-premises ECS Anywhere instances is a very intriguing model for hybrid cloud and dev/test batch processing workloads. It is currently possible to create an unmanaged compute environment linked to an ECS cluster with external instances, but there's no way to tell the Batch control plane to run tasks on the external instances. It can be seen from CloudTrail logs that Batch invokes the RunTask operation using the setting "launchType": "EC2":

{
    "eventName": "RunTask",
    "sourceIPAddress": "batch.amazonaws.com",
    "userAgent": "batch.amazonaws.com",
    "requestParameters": {
        "launchType": "EC2"   // <-------
    },
    "responseElements": {
        "tasks": [],
        "failures": [
            {
                "arn": "arn:aws:ecs:us-east-1:123456789012:container-instance/3e0ef3abc8b54a5b94fff48ece354d60",
                "reason": "AGENT"
            }
        ]
    },
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "eventCategory": "Management"
}

My understanding is that this needs to be "launchType": "EXTERNAL" in order for the task to run on ECS Anywhere instances. It would be desirable to be able to configure Batch compute environments to use the EXTERNAL launch type.

Are you currently working around this issue? I do not currently have a workaround for this issue.

Additional context N/A

Attachments N/A

rrizun commented 2 years ago

indeed.. would be nice to be able to specify --platform-capabilities EXTERNAL, e.g.,

aws batch register-job-definition --job-definition-name sleep30 --type container --container-properties '{ "image": "busybox", "vcpus": 1, "memory": 128, "command": [ "sleep", "30"]}' --platform-capabilities EXTERNAL

rrizun commented 2 years ago
   --platform-capabilities (list)
      The platform capabilities required by  the  job  definition.  If  no
      value  is  specified, it defaults to EC2 . To run the job on Fargate
      resources, specify FARGATE .

      (string)

   Syntax:

      "string" "string" ...

      Where valid values are:
        EC2
        FARGATE
rrizun commented 2 years ago

tried to sneak it past using --cli-input-json .. no luck =(

rrizun@rrizuns-MacBook-Air farspot % cat newjob.json 
{
    "jobDefinitionName": "sleep30",
    "type": "container",
    "containerProperties": {
        "image": "busybox",
        "vcpus": 1,
        "memory": 1024,
        "command": [
            "sleep",
            "30"
        ]
    },
    "platformCapabilities": [
        "EXTERNAL"
    ]
}
rrizun@rrizuns-MacBook-Air farspot % aws batch register-job-definition --cli-input-json file://newjob.json

An error occurred (ClientException) when calling the RegisterJobDefinition operation: Error executing request, Exception : Capability EXTERNAL is not valid. Valid capabilities: [FARGATE, EC2], RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
phitoduck commented 2 years ago

I'm fairly deep into a project where I assumed this would be possible. I think the existence of this thread means we're officially at a dead end. We were hoping to run the Metaflow ML model training framework on our own GPU-machines.

Is there anything that can be done to help prioritize this?

jamie1911 commented 4 weeks ago

With GPUs in AWS being in such high demand, having the option to use our on-prem GPU clusters in AWS Batch would be incredibly helpful. If Batch Job Definitions supported the EXTERNAL option, we could easily switch some of our jobs to on-prem GPUs with minimal adjustments. Despite trying several workarounds, none have been successful so far. Notably, using EXTERNAL works seamlessly with ECS Job Definitions but unfortunately not with Batch, even after three years.