Open gabbychen opened 1 week ago
Memory buffers not being registered perhaps? https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html
Hi, I utilized nccl_test for profiling. What I found is the write data size in memory is doubled than the received data size I set in nccl_test (e.g. I set the total data size as 1 GB for AllGather profiling with 4 GPU, the write data size is 1.5 GB for in place (2 x 0.75 GB (received data) and 1.75 GB for out of place, when I utilized the same data size with 2 GPU, the write data size is 1 GB (2 x 0.5 GB (received data size) for inplace and 1.5GB for out-of-place) I thought it's caused by internal mechanism.
I wonder if it's caused by extra buffer on communication or some other reason? (1) Is it possible to remove the extra buffer copy with direct copy to get better communication performance? (2) If I remove the extra buffer copy will cause extra problem?
Please try the -R 1
option when using nccl-tests: https://github.com/NVIDIA/nccl-tests/blob/8dfeab9eb9bdfdf13503e71e1f33e7f8a208b540/src/common.cu#L876
Thanks, I will try it.
Hi