aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
143 stars 56 forks source link

NCCL Cannot Find Tuner Symbols. Need to Export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/lib/libnccl-ofi-tuner.so #472

Open zhanwenchen opened 2 months ago

zhanwenchen commented 2 months ago

Hello,

I followed the official AWS AWS-OFI Plugin installation guide, but I found that there is a potential issue with the tuner. When I run the nccl-tests command in the linked guide:

/opt/amazon/openmpi/bin/mpirun \
-x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi5/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
--hostfile my-hosts -n 8 -N 8 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

I got:

ip-172-31-18-239:755152:755194 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
ip-172-31-18-239:755152:755194 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.

Only with export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/lib/libnccl-ofi-tuner.so do I get

ip-172-31-18-239:754820:754863 [5] NCCL INFO TUNER/Plugin: Plugin name set by env to /opt/aws-ofi-nccl/lib/libnccl-ofi-tuner.so
ip-172-31-18-239:754820:754863 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
ip-172-31-18-239:754820:754863 [5] NCCL INFO TUNER/Plugin: Using tuner plugin nccl_ofi_tuner
rauteric commented 2 months ago

Yes, the current public instructions do not load the tuner. Setting NCCL_TUNER_PLUGIN as you have done is the correct way to load the tuner.

Loading the tuner is not required to use the plugin, although the tuner improves performance in some configurations. We may update the public instructions in the future to include loading the tuner.