Closed mattmcclean closed 1 year ago
@mattmcclean Can you also supply a use case so that we can ideate on what a potential integration might look like.
The use case would be to run a distributed training job running on multiple machines using a library such as PyTorch or TensorFlow. This is common for large scale Computer Vision or NLP tasks which have large models that cannot fit into the memory of a single GPU (referred to as model parallelism) or where you want to speed up the training of a model by splitting your training data over multiple nodes and sync the weights of your models over each GPU (referred to as data parallelism). The max number of GPUs per EC2 instance is 8 and some users may want to run on a larger cluster than that using Horovod, DeepSpeed or the native distributed training library from TensorFlow or PyTorch.
That's interesting. We have an effort in-flight to integrate with AWS Batch array jobs #415 (since AWS Step Functions refuses to execute more than ~40 Batch jobs at a time). Let me think a bit about a possible programming model to support model parallelism.
It would be quite simple as it only requires defining a slightly different job definition. Below is an example of a multinode job definition. You define type
to be multinode
(instead of container
) and then set the nodeProperties
parameter with a list of container properties per instance.
{
"jobDefinitionName": "gromacs-jobdef",
"jobDefinitionArn": "arn:aws:batch:us-east-2:123456789012:job-definition/gromacs-jobdef:1",
"revision": 6,
"status": "ACTIVE",
"type": "multinode",
"parameters": {},
"nodeProperties": {
"numNodes": 2,
"mainNode": 0,
"nodeRangeProperties": [
{
"targetNodes": "0:1",
"container": {
"image": "123456789012.dkr.ecr.us-east-2.amazonaws.com/gromacs_mpi:latest",
"vcpus": 8,
"memory": 24000,
"command": [],
"jobRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"ulimits": [],
"instanceType": "p3.2xlarge"
}
}
]
}
}
That's true, but we need to think through how a user step gets mapped to a multi-node Batch job.
How we parallel train in PyTorch: https://github.com/zillow/metaflow/blob/feature/kfp/metaflow/tutorials/10-pytorch/hello_pytorch.py#L37-L61.
A Metaflow native gang scheduling approach would be better.
@parallel
affords support for this pattern on AWS Batch. Please reach out to chat.metaflow.org if you would like to get started.
AWS Batch provides the ability to run multi-node parallel jobs enable you to run single jobs that span multiple Amazon EC2 instances. With AWS Batch multi-node parallel jobs, you can run distributed GPU model training without the need to launch, configure, and manage Amazon EC2 resources directly. See docs.
It requires a slightly different config for the job definition file with the
type
parameter set tomultinode
instead ofcontainer
. Perhaps in the batch decorator there could be a flag for distributed training to create a different job definition file to support. Would be great too to have a job definition that allows for EFA support similar to this blog post.