Open nachtsky1077 opened 4 years ago
Yes it is expected. Both LL128 and Simple are optimized for bandwidth which means using 128bit load stores, and this doesn't work well on unaligned buffers. Simple should have better performance on NCCL 2.8, but padding is a solution that would always work well.
Thanks for the explanation! I'm also curious about the reason why the unaligned buffers would ruin the performance so badly (around 13x according to the tests). Thanks!
The whole code being optimized for 128b operations (which happen to require 128b alignment), when we detect buffers are not aligned, we need to fall back on a much slower code. I agree the fallback code could probably be more optimized --and we did improve it in 2.8 for Simple-- but it's not always easy.
@sjeaugey Hi, is there any progress now?
Unaligned LL128 was improved in 2.10
I'm running into cases where I have to allgather extra bytes(say 4 bytes), which make the data not perfectly 32byte or 64byte aligned. While doing this, substantial performance degradation was observed. I did nccl test to validate the observation(nccl v2.5.7-1 was used): test command: ./all_gather_perf -b 16373696 -e 16373792 -i 4 -g 2 Case 1: NCCL_PROTO=Simple
Case 2: NCCL_PROTO=LL
Case 3: NCCL_PROTO=LL128
Based on these tests, Simple and LL128 protocols are sensitive to data alignment, while LL protocol is not.
My question is:
Thanks!