[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

9.48k stars 2.14k forks source link

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively? #785

Open ezioliao opened 3 months ago

ezioliao commented 3 months ago

If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call optimizer.step() the two optimizers perform grad-norm for their own parameters.

But if we do not enable expert parallelism, all model parameter's grad will be normed as entirely.

So my question is that the behavior of grad norm is different mathematically whether expert parallelism is turned on. Is it expetced?

github-actions[bot] commented 1 month ago

Marking as stale. No activity in 60 days.

deepakn94 commented 1 month ago

I believe this behavior is fixed now.