Open ihchoi12 opened 2 years ago
The code in common.cu
is not the NCCL code, it's the NCCL tests code to verify data is correct. The decision to use 2x288 is here: https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L455-L484
Oh, I see! Thanks for the clear clarification. Let me investigate that part.
Hello team!
I'm analyzing the result of
all_reduce_perf
result tested on two AWS p3.x2large nodes (one GPU per node). Here is the command I run:mpirun --mca btl_tcp_if_include ens3 -npernode 1 -np 2 --hostfile ./hostfile -verbose nsys profile -t cuda,osrt,nvtx,cudnn,cublas -f true -o ./baseline -w true /home/ih/nccl-tests/build/all_reduce_perf -b 128M -e 128M -g 1 -w 0 -n 1 -c 0
Currently, I'm analyzing the size of the grid and block launched by NCCL allreduce, and I have one question:
I checked that 2 blocks (grid <<<2, 1, 1>>>) and 288 threads per block (block <<<288, 1, 1>>>) are launched by ncclKernel_AllReduce operation. Here is the evidence I checked:
To check how the kernels are launched in the code, I checked
common.cu
code: https://github.com/NVIDIA/nccl-tests/blob/8274cb47b6dc70ce4411e7f114b77173d3892414/src/common.cu#L342Interestingly, the code seems to launch 32 blocks (grid <<<32, 1, 1>>>) and 256 threads per block (block <<<256, 1, 1>>>). May I know why the size of the grid and block are different from the code in the result? Am I misunderstanding something here?
Thanks!