The size of grid and block seems mismatch

ihchoi12 commented 2 years ago

Hello team!

I'm analyzing the result of all_reduce_perf result tested on two AWS p3.x2large nodes (one GPU per node). Here is the command I run: mpirun --mca btl_tcp_if_include ens3 -npernode 1 -np 2 --hostfile ./hostfile -verbose nsys profile -t cuda,osrt,nvtx,cudnn,cublas -f true -o ./baseline -w true /home/ih/nccl-tests/build/all_reduce_perf -b 128M -e 128M -g 1 -w 0 -n 1 -c 0

Currently, I'm analyzing the size of the grid and block launched by NCCL allreduce, and I have one question:

I checked that 2 blocks (grid <<<2, 1, 1>>>) and 288 threads per block (block <<<288, 1, 1>>>) are launched by ncclKernel_AllReduce operation. Here is the evidence I checked:

To check how the kernels are launched in the code, I checked common.cu code: https://github.com/NVIDIA/nccl-tests/blob/8274cb47b6dc70ce4411e7f114b77173d3892414/src/common.cu#L342

Interestingly, the code seems to launch 32 blocks (grid <<<32, 1, 1>>>) and 256 threads per block (block <<<256, 1, 1>>>). May I know why the size of the grid and block are different from the code in the result? Am I misunderstanding something here?

Thanks!

sjeaugey commented 2 years ago

The code in common.cu is not the NCCL code, it's the NCCL tests code to verify data is correct. The decision to use 2x288 is here: https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L455-L484

ihchoi12 commented 2 years ago

Oh, I see! Thanks for the clear clarification. Let me investigate that part.

NVIDIA / nccl-tests

The size of grid and block seems mismatch #110