aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
133 stars 52 forks source link

NCCL WARN NET/OFI Only EFA provider is supported #174

Closed mkserge closed 1 year ago

mkserge commented 1 year ago

Hi,

Apologies in advance if this is not the right place to ask.

I am trying to run PyTorch DDP with NCCL backend on SageMaker. I have my own Docker image which uses the following as a base 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py38-cu117-ubuntu20.04-sagemaker

This is configured as a SageMaker Training job through the AWS Console (I am working on someone else's setup, so no full control/understanding of the underlying details).

The instance is ml.p3.16xlarge with 8 V100 GPUs. I don't need node-to-node communication, just communication between the GPUs on a single node. I have no issues running my code on EC2 instance (ml.p3.16xlarge, but not using the docker image directly)

Running my job, I see the following warnings in the logs. and then the job seems to "hang".

algo-1:249:249 [0] ofi_init:1288 NCCL WARN NET/OFI Only EFA provider is supported
algo-1:249:249 [0] ofi_init:1339 NCCL WARN NET/OFI aws-ofi-nccl initialization failed

Could someone help me figure out the source of the issue here? I am not too familiar with EFA but digging through documentation here I see it's only supported on p3dn.24xlarge and p4d.24xlarge instances, which is not what I need. Is this a configuration issue with the container? Why is only EFA provider supported through NCCL?

Any pointers would be really appreciated.

rashikakheria commented 1 year ago

@mkserge As you noted, aws-ofi-nccl plugin is supported only on p3dn.24xlarge and p4d.24xlarge instances. Also, EFA does't support p3.16xlarge too (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).

Since you only want to do intra-node communication, I can advise the following:

  1. Remove EFA installation from your docker image as this isn't a supported platform type.
  2. Use NCCL's socket interface directly rather than using plugin. (You can change LD_LIBRARY_PATH so that NCCL is not able to dynamically load the plugin)

I hope that helps.

mkserge commented 1 year ago

Hi Rashika,

Thank you for your response and apologies for the delay.

I went ahead and removed all the EFA-related parts from the Dockerfile in aws/deep-learning-containers repository and rebuild the image from scratch and it seems to be working now. Thank you for your advice!

It seems a bit odd that the AWS DLC ship with NCCL with EFA support only. It looks like that essentially prevents one from doing distributed training with NCCL backend on anything other than those two instances, unless one is willing to build the base images themselves. I opened a related issue in aws/deep-learning-containers repository to get some feedback on this.