Thx for ur brilliant work!
I have a small question regarding the content of the paper: In Table 1, for the sequence length, will the total length of the Megatron-LM method be equal to the sum of the lengths on each GPU? My understanding is that within the same group of tensor parallelism in Megatron-LM , the input sequence should be the same.
If my understanding is incorrect, please kindly point it out.
Thx for ur brilliant work! I have a small question regarding the content of the paper: In Table 1, for the sequence length, will the total length of the Megatron-LM method be equal to the sum of the lengths on each GPU? My understanding is that within the same group of tensor parallelism in Megatron-LM , the input sequence should be the same. If my understanding is incorrect, please kindly point it out.