[BUG] `finish_embedding_wgrad_compute` appears after grad all-reduce

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

Other

9.93k stars 2.24k forks source link

Open QPHutu opened 3 weeks ago

QPHutu commented 3 weeks ago

Describe the bug

In megatron/core/pipeline_parallel/schedules.py, finish_embedding_wgrad_compute should appear before enable_grad_sync and grad_sync_func?

Expected behavior Gradient all-reduce should happen after gradient computations.

elliottnv commented 3 weeks ago

@sanandaraj5597 Can you share some comments? Thank you!