Megatron-LM’s communication

RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

179 stars 8 forks source link

Closed wangpengfei1013 closed 4 months ago

wangpengfei1013 commented 5 months ago

Isn't it four allgather and reducescatter？

DachengLi1 commented 5 months ago

There is an extra all-gather here (*2, two block per layer).

wangpengfei1013 commented 5 months ago

There is an extra all-gather here (*2, two block per layer).

Thank you for your answer. I think I understand it