NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.93k stars 2.24k forks source link

[BUG] `finish_embedding_wgrad_compute` appears after grad all-reduce #1012

Open QPHutu opened 3 weeks ago

QPHutu commented 3 weeks ago

Describe the bug

In megatron/core/pipeline_parallel/schedules.py, finish_embedding_wgrad_compute should appear before enable_grad_sync and grad_sync_func?

image

Expected behavior Gradient all-reduce should happen after gradient computations.

elliottnv commented 3 weeks ago

@sanandaraj5597 Can you share some comments? Thank you!