[QUESTION] --overlap-grad-allreduce failing as gradients coming through as None in param hook

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

9.23k stars 2.08k forks source link

[QUESTION] --overlap-grad-allreduce failing as gradients coming through as None in param hook #879

Closed jambo6 closed 1 week ago

jambo6 commented 1 week ago

When I set --overlap-grad-allreduce my run fails because gradients are None inside the hook. It then fails due to this code

                if self.ddp_config.overlap_grad_reduce:
                    assert (
                        param.grad is not None
                    ), 'param.grad being None is not safe when overlap_grad_reduce is True'

Gradients are available in the optimizer step, so its not that I'm just not computing gradients.

When I disable overlap I also find that every gradient is None inside the backwards hook.

jambo6 commented 1 week ago

Issue on my end

deepakn94 commented 1 week ago

What was the issue? Might be useful for other users.