Distributed training all-reduce order

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

9.2k stars 2.07k forks source link

Distributed training all-reduce order #107

Open zhiqi-0 opened 3 years ago

zhiqi-0 commented 3 years ago

Hi,

I'm just wondering if there is a potential issue for all-reduce order when both data parallelism and tensor model parallelism are enabled during training. With torch DDP, both tensor model parallelism and data parallelism use all-reduce, and are launched on different streams. While the execution order is determined by hardware, will it cause hanging in some cases like:

GPU1: [MP] all-reduce -> [DP] all-reduce GPU2: [DP] all-reduce -> [MP] all-reduce

From issues discussed here, I think it may be unsafe for undetermined order.

github-actions[bot] commented 11 months ago

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 9 months ago

Marking as stale. No activity in 60 days.

fwyc0573 commented 5 months ago

I have encountered the same problem. Can anyone provide some opinions?

deepakn94 commented 5 months ago

@fwyc0573 are you seeing a hang? Can you describe the setting, perhaps provide an example command line, and also paste the last couple of lines in the logs?

github-actions[bot] commented 3 months ago

Marking as stale. No activity in 60 days.