Support for AWS Batch multi-node parallel jobs

Netflix / metaflow

:rocket: Build and manage real-life ML, AI, and data science projects with ease!

https://metaflow.org

Apache License 2.0

7.86k stars 744 forks source link

Support for AWS Batch multi-node parallel jobs #444

Closed mattmcclean closed 1 year ago

mattmcclean commented 3 years ago

AWS Batch provides the ability to run multi-node parallel jobs enable you to run single jobs that span multiple Amazon EC2 instances. With AWS Batch multi-node parallel jobs, you can run distributed GPU model training without the need to launch, configure, and manage Amazon EC2 resources directly. See docs.

It requires a slightly different config for the job definition file with the type parameter set to multinode instead of container. Perhaps in the batch decorator there could be a flag for distributed training to create a different job definition file to support. Would be great too to have a job definition that allows for EFA support similar to this blog post.

savingoyal commented 3 years ago

@mattmcclean Can you also supply a use case so that we can ideate on what a potential integration might look like.

mattmcclean commented 3 years ago

The use case would be to run a distributed training job running on multiple machines using a library such as PyTorch or TensorFlow. This is common for large scale Computer Vision or NLP tasks which have large models that cannot fit into the memory of a single GPU (referred to as model parallelism) or where you want to speed up the training of a model by splitting your training data over multiple nodes and sync the weights of your models over each GPU (referred to as data parallelism). The max number of GPUs per EC2 instance is 8 and some users may want to run on a larger cluster than that using Horovod, DeepSpeed or the native distributed training library from TensorFlow or PyTorch.

savingoyal commented 3 years ago

That's interesting. We have an effort in-flight to integrate with AWS Batch array jobs #415 (since AWS Step Functions refuses to execute more than ~40 Batch jobs at a time). Let me think a bit about a possible programming model to support model parallelism.

mattmcclean commented 3 years ago

It would be quite simple as it only requires defining a slightly different job definition. Below is an example of a multinode job definition. You define type to be multinode (instead of container) and then set the nodeProperties parameter with a list of container properties per instance.

{
  "jobDefinitionName": "gromacs-jobdef",
  "jobDefinitionArn": "arn:aws:batch:us-east-2:123456789012:job-definition/gromacs-jobdef:1",
  "revision": 6,
  "status": "ACTIVE",
  "type": "multinode",
  "parameters": {},
  "nodeProperties": {
    "numNodes": 2,
    "mainNode": 0,
    "nodeRangeProperties": [
      {
        "targetNodes": "0:1",
        "container": {
          "image": "123456789012.dkr.ecr.us-east-2.amazonaws.com/gromacs_mpi:latest",
          "vcpus": 8,
          "memory": 24000,
          "command": [],
          "jobRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
          "ulimits": [],
          "instanceType": "p3.2xlarge"
        }
      }
    ]
  }
}

savingoyal commented 3 years ago

That's true, but we need to think through how a user step gets mapped to a multi-node Batch job.

talebzeghmi commented 3 years ago

How we parallel train in PyTorch: https://github.com/zillow/metaflow/blob/feature/kfp/metaflow/tutorials/10-pytorch/hello_pytorch.py#L37-L61.

A Metaflow native gang scheduling approach would be better.

savingoyal commented 1 year ago

@parallel affords support for this pattern on AWS Batch. Please reach out to chat.metaflow.org if you would like to get started.