About Megatron-LM sequence length

RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

188 stars 9 forks source link

About Megatron-LM sequence length #3

Closed ZhongYingMatrix closed 10 months ago

ZhongYingMatrix commented 10 months ago

Thx for ur brilliant work! I have a small question regarding the content of the paper: In Table 1, for the sequence length, will the total length of the Megatron-LM method be equal to the sum of the lengths on each GPU? My understanding is that within the same group of tensor parallelism in Megatron-LM , the input sequence should be the same. If my understanding is incorrect, please kindly point it out.

DachengLi1 commented 10 months ago

Yes you are absolutely correct, Megatron-LM uses the global sequence length in the same group of TP! We will update in the paper! Thx for the catch!