Closed mkserge closed 1 year ago
@mkserge As you noted, aws-ofi-nccl plugin is supported only on p3dn.24xlarge and p4d.24xlarge instances. Also, EFA does't support p3.16xlarge too (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).
Since you only want to do intra-node communication, I can advise the following:
I hope that helps.
Hi Rashika,
Thank you for your response and apologies for the delay.
I went ahead and removed all the EFA-related parts from the Dockerfile in aws/deep-learning-containers repository and rebuild the image from scratch and it seems to be working now. Thank you for your advice!
It seems a bit odd that the AWS DLC ship with NCCL with EFA support only. It looks like that essentially prevents one from doing distributed training with NCCL backend on anything other than those two instances, unless one is willing to build the base images themselves. I opened a related issue in aws/deep-learning-containers repository to get some feedback on this.
Hi,
Apologies in advance if this is not the right place to ask.
I am trying to run PyTorch DDP with NCCL backend on SageMaker. I have my own Docker image which uses the following as a base
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py38-cu117-ubuntu20.04-sagemaker
This is configured as a SageMaker Training job through the AWS Console (I am working on someone else's setup, so no full control/understanding of the underlying details).
The instance is
ml.p3.16xlarge
with 8 V100 GPUs. I don't need node-to-node communication, just communication between the GPUs on a single node. I have no issues running my code on EC2 instance (ml.p3.16xlarge
, but not using the docker image directly)Running my job, I see the following warnings in the logs. and then the job seems to "hang".
Could someone help me figure out the source of the issue here? I am not too familiar with EFA but digging through documentation here I see it's only supported on
p3dn.24xlarge
andp4d.24xlarge
instances, which is not what I need. Is this a configuration issue with the container? Why is only EFA provider supported through NCCL?Any pointers would be really appreciated.