I wanna ask about the performance comparison between int32 and fp16 datatype when using the allreduce API. I am not sure it's normal or not, but the int32 latency is almost 6x larger than fp16. It's kind of wired considering the bit differnece is only 16. Could you please give me some insights?
It should not be the case unless the int32 operation has 3x more elements. You may also want to check the buffer alignment; it is good to align buffers to 16 bytes.
Hi there,
I wanna ask about the performance comparison between int32 and fp16 datatype when using the allreduce API. I am not sure it's normal or not, but the int32 latency is almost 6x larger than fp16. It's kind of wired considering the bit differnece is only 16. Could you please give me some insights?
Thanks : )