RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
179 stars 8 forks source link

Megatron-LM’s communication #7

Closed wangpengfei1013 closed 4 months ago

wangpengfei1013 commented 5 months ago

image

Isn't it four allgather and reducescatter?

DachengLi1 commented 5 months ago
Screenshot 2024-03-18 at 12 32 08 PM

There is an extra all-gather here (*2, two block per layer).

wangpengfei1013 commented 5 months ago

Screenshot 2024-03-18 at 12 32 08 PM There is an extra all-gather here (*2, two block per layer).

Thank you for your answer. I think I understand it