Open PhdShi opened 1 month ago
Why are you setting NCCL_P2P_USE_CUDA_MEMCPY
?
Why are you setting
NCCL_P2P_USE_CUDA_MEMCPY
?
I noticed that this #922 issue mentioned that turning on NCCL_P2P_USE_CUDA_MEMCPY would have some performance improvements, and I would like to test it. But my test data shows that the NCCL_P2P_USE_CUDA_MEMCPY causes poor performance for allreduce large messages.
Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.
Which is expected. It's not doing what you think it does. As most other environment variables (aside from node configuration), you should not set it unless you really need it.
Can you explain why the performance decline is expected? I use the cuda-samples/p2pBandwidthLatencyTest test, found that using the copy engine performance is much better than sm_copy. Does this mean that allreduce should have a better performance when enabling NCCL_P2P_USE_CUDA_MEMCPY
NCCL_P2P_USE_CUDA_MEMCPY is not doing what you think. No, it won't improve performance on your system, unless you have a system where SM-based copy is a disaster (like 10x slower than CE). Then it could help -- sometimes. Again, don't use undocumented environment variables.
I am using the Nsight system tool to observe the behavior of allreduce_perf on a server with 8 H800 gpus. I found that when the NCCL_P2P_USE_CUDA_MEMCPY function is enabled, the nsys profile command will hang after running allreduce_perf without generating corresponding files. Here is my run script:
This is my executive command:
nsys profile -o allreduce_ce_default.nsys-rep bash runtest.sh
NGC image version:
nvcr.io/nvidia/pytorch:24.06-py3