NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.26k stars 827 forks source link

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is enabled #1480

Open PhdShi opened 1 month ago

PhdShi commented 1 month ago

I am using the Nsight system tool to observe the behavior of allreduce_perf on a server with 8 H800 gpus. I found that when the NCCL_P2P_USE_CUDA_MEMCPY function is enabled, the nsys profile command will hang after running allreduce_perf without generating corresponding files. Here is my run script:

#!/usr/bin
/usr/local/mpi/bin/mpirun --allow-run-as-root --mca btl_openib_warn_no_device_params_found 0 --mca btl_tcp_if_include bond0 --hostfile iplist  --map-by ppr:8:node -np 8 -x NCCL_IB_TC=136 -x NCCL_IB_SL=5 -x NCCL_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME=bond -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5 -x NCCL_IB_TIMEOUT=22 -x NCCL_IB_QPS_PER_CONNECTION=8 -x NCCL_NET_PLUGIN=none -x NCCL_ALGO=Ring -x NCCL_P2P_USE_CUDA_MEMCPY=1 -x LD_PRELOAD=/workspace/nccl2.21.5/build/lib/libnccl.so.2 /usr/bin/all_reduce_perf -b 4k -e 8G -g 1 -f 2 -n 50 -w 10 

This is my executive command: nsys profile -o allreduce_ce_default.nsys-rep bash runtest.sh

NGC image version:nvcr.io/nvidia/pytorch:24.06-py3

sjeaugey commented 1 month ago

Why are you setting NCCL_P2P_USE_CUDA_MEMCPY?

PhdShi commented 1 month ago

Why are you setting NCCL_P2P_USE_CUDA_MEMCPY?

I noticed that this #922 issue mentioned that turning on NCCL_P2P_USE_CUDA_MEMCPY would have some performance improvements, and I would like to test it. But my test data shows that the NCCL_P2P_USE_CUDA_MEMCPY causes poor performance for allreduce large messages.

sjeaugey commented 1 month ago

Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.

PhdShi commented 1 month ago

Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.

Can you explain why the performance decline is expected? I use the cuda-samples/p2pBandwidthLatencyTest test, found that using the copy engine performance is much better than sm_copy. Does this mean that allreduce should have a better performance when enabling NCCL_P2P_USE_CUDA_MEMCPY

sjeaugey commented 1 month ago

NCCL_P2P_USE_CUDA_MEMCPY is not doing what you think. No, it won't improve performance on your system, unless you have a system where SM-based copy is a disaster (like 10x slower than CE). Then it could help -- sometimes. Again, don't use undocumented environment variables.