NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 821 forks source link

INT32 vs. FP16 performance on NCCL reduction #712

Open minghaoBD opened 2 years ago

minghaoBD commented 2 years ago

Hi there,

I wanna ask about the performance comparison between int32 and fp16 datatype when using the allreduce API. I am not sure it's normal or not, but the int32 latency is almost 6x larger than fp16. It's kind of wired considering the bit differnece is only 16. Could you please give me some insights?

Thanks : )

image image

sjeaugey commented 2 years ago

It should not be the case unless the int32 operation has 3x more elements. You may also want to check the buffer alignment; it is good to align buffers to 16 bytes.