bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.3k stars 211 forks source link

Cannot run 3D parallelism with tp == 1 dp == 3 pp == 2 degrees #397

Closed Heelim-Hong closed 12 months ago

Heelim-Hong commented 1 year ago

Hello everyone,

I'm currently trying to run GPT-2 utilizing 3D parallelism. My configuration for the degrees of parallelism is as follows:

Tensor parallelism (tp) = 1 Data parallelism (dp) = 3 Pipeline parallelism (pp) = 2

I'm operating on 2 nodes (one with two V100 GPUs and the other with four 2080Ti GPUs). Considering the GPU resources at my disposal, I believe I should be able to run this setup without any issues.

global rank mapping = {'node1': [0, 1], 'node2': [2, 3, 4, 5]} MPU DP: [0, 1, 2] MPU DP: [3, 4, 5] MPU PP: [0, 3] MPU PP: [1, 4] MPU PP: [2, 5] MPU IO: [0, 1, 2, 3, 4, 5] MPU MP: [0] MPU MP: [1] MPU MP: [2] MPU MP: [3] MPU MP: [4] MPU MP: [5]

However, I noticed that all GPUs within a data parallel group hold identical model parameters. I'm curious if the degree of data parallelism (having a value of 3, which is odd and not a power of 2) might be the underlying issue.

Has anyone encountered similar challenges or have insights on whether the degree of data parallelism being an odd number could cause any issues?

I think the error comes from File "python3.7/site-packages/torch/distributed/distributed_c10d.py", line 956, in all_reduce work = group.allreduce([tensor], opts).

Thanks in advance for your help!