NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.2k stars 1.36k forks source link

sequence parallel with rmsnorm/layernorm #1686

Open wlike opened 1 year ago

wlike commented 1 year ago

When the sequence parallelism is enabled along with the tensor parallelism in the training stage with Megatron, there will be multiple copies of parameters of RMSNorm or LayerNorm, and they are different.

For example, when the tensor parallelism is 8, and the hidden dimension is 1024, there will be 8 parameter tensors each with a dimension of 1024, and the 8 tensors' values are different. And if the tensor parallelism is 4, there will be 4 parameter tensors with different values.

In this situation, how to convert these different parameter tensors to be one to do inference? Thank you!