bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.3k stars 211 forks source link

Question about the implementation of mpu.cross_entropy when using tensor parallel #394

Open robin087 opened 1 year ago

robin087 commented 1 year ago

Hello. When using tensor parallel on bloom (tp_size = 8), we find that the cross_entropy loss computed by mpu.cross_entropy is different from torch.nn.functional.cross_entropy. The difference is about 1% for our data. For the implementation of mpu.cross_entropy, we find that the loss is computed on the partition_vocab_size which is 8 times smaller than vocab_size (tp_size = 8). We think maybe this implementation causes the difference above. In this case, is this implementation correct? Or can this implementation ensure the performance when using tensor parallel?