If you've installed aws-ofi-nccl from conda and have a system with version of libfabric <1.18.2 and aws-ofi-nccl 1.9.0 you may face issues such as the following:
[0] NCCL INFO NET/Plugin : dlerror=/opt/amazon/efa/lib/libfabric.so.1: version `FABRIC_1.7' not found (required by /fsx/ubuntu/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.10/site-packages/torch/lib/../../../../libnccl-net.so) No plugin found (libnccl-net.so), using internal implementation
You can fix this by upgrading to aws-ofi-nccl 1.9.1 or downgrading to aws-ofi-nccl 1.7.4 like so:
If you've installed aws-ofi-nccl from conda and have a system with version of libfabric
<1.18.2
andaws-ofi-nccl 1.9.0
you may face issues such as the following:You can fix this by upgrading to
aws-ofi-nccl 1.9.1
or downgrading toaws-ofi-nccl 1.7.4
like so:Fixed in https://github.com/aws-samples/awsome-distributed-training/pull/291