sequence parallel with rmsnorm/layernorm

When the sequence parallelism is enabled along with the tensor parallelism in the training stage with Megatron, there will be multiple copies of parameters of RMSNorm or LayerNorm, and they are different.

For example, when the tensor parallelism is 8, and the hidden dimension is 1024, there will be 8 parameter tensors each with a dimension of 1024, and the 8 tensors' values are different. And if the tensor parallelism is 4, there will be 4 parameter tensors with different values.

In this situation, how to convert these different parameter tensors to be one to do inference? Thank you!

NVIDIA / apex

sequence parallel with rmsnorm/layernorm #1686