aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
174 stars 73 forks source link

NCCL libfabric conflict caused by aws-ofi-nccl 1.9.0 #292

Open sean-smith opened 4 months ago

sean-smith commented 4 months ago

If you've installed aws-ofi-nccl from conda and have a system with version of libfabric <1.18.2 and aws-ofi-nccl 1.9.0 you may face issues such as the following:

 [0] NCCL INFO NET/Plugin : dlerror=/opt/amazon/efa/lib/libfabric.so.1: version `FABRIC_1.7' not found (required by /fsx/ubuntu/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.10/site-packages/torch/lib/../../../../libnccl-net.so) No plugin found (libnccl-net.so), using internal implementation

You can fix this by upgrading to aws-ofi-nccl 1.9.1 or downgrading to aws-ofi-nccl 1.7.4 like so:

conda install aws-ofi-nccl=1.7.4 \
--override-channels \
-c https://aws-ml-conda.s3.us-west-2.amazonaws.com/ \
-c nvidia -c conda-forge

Fixed in https://github.com/aws-samples/awsome-distributed-training/pull/291

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 30 days with no activity.