Regarding the AllGather bandwidth with different byte alignment under different protocols

nachtsky1077 commented 4 years ago

I'm running into cases where I have to allgather extra bytes(say 4 bytes), which make the data not perfectly 32byte or 64byte aligned. While doing this, substantial performance degradation was observed. I did nccl test to validate the observation(nccl v2.5.7-1 was used): test command: ./all_gather_perf -b 16373696 -e 16373792 -i 4 -g 2 Case 1: NCCL_PROTO=Simple

Case 2: NCCL_PROTO=LL

Case 3: NCCL_PROTO=LL128

Based on these tests, Simple and LL128 protocols are sensitive to data alignment, while LL protocol is not.

My question is:

Is this an expected behavior? What is the underlying mechanism for the behavior of different protocols with different data byte alignments? I guess I'm asking the underlying relationship between data alignment and protocol.
If this is the case, what is the recommended way to workaround the performance degradation, currently I'm doing a padding to achieve better performance?

Thanks!

sjeaugey commented 4 years ago

Yes it is expected. Both LL128 and Simple are optimized for bandwidth which means using 128bit load stores, and this doesn't work well on unaligned buffers. Simple should have better performance on NCCL 2.8, but padding is a solution that would always work well.

nachtsky1077 commented 4 years ago

Thanks for the explanation! I'm also curious about the reason why the unaligned buffers would ruin the performance so badly (around 13x according to the tests). Thanks!

sjeaugey commented 4 years ago

The whole code being optimized for 128b operations (which happen to require 128b alignment), when we detect buffers are not aligned, we need to fall back on a much slower code. I agree the fallback code could probably be more optimized --and we did improve it in 2.8 for Simple-- but it's not always easy.

KinglittleQ commented 2 years ago

@sjeaugey Hi, is there any progress now?

jbachan commented 2 years ago

Unaligned LL128 was improved in 2.10

NVIDIA / nccl

Regarding the AllGather bandwidth with different byte alignment under different protocols #413