Question about the implementation of mpu.cross_entropy when using tensor parallel

Hello. When using tensor parallel on bloom (tp_size = 8), we find that the cross_entropy loss computed by mpu.cross_entropy is different from torch.nn.functional.cross_entropy. The difference is about 1% for our data. For the implementation of mpu.cross_entropy, we find that the loss is computed on the partition_vocab_size which is 8 times smaller than vocab_size (tp_size = 8). We think maybe this implementation causes the difference above. In this case, is this implementation correct? Or can this implementation ensure the performance when using tensor parallel?

bigscience-workshop / Megatron-DeepSpeed

Question about the implementation of mpu.cross_entropy when using tensor parallel #394