NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 821 forks source link

nccl topo about PHB and NODE #1502

Open jianzi123 opened 1 week ago

jianzi123 commented 1 week ago

Image

Like this graph, NET/1 and GPU(0) in the same numa, NET/0 and GPU(1) in the same numa, all under the same rc, why does nccl recognize PHB instead of NODE

kiskra-nvidia commented 1 week ago

Could you include more of NCCL's debug output, and maybe also the topo file (generated using NCCL_TOPO_DUMP_FILE)? What does nvidia-smi topo -m say?