[QUESTION] Effect of sequence parallel with dropout rng context

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

10.65k stars 2.39k forks source link

[QUESTION] Effect of sequence parallel with dropout rng context #1256

Closed sbmaruf closed 3 weeks ago

sbmaruf commented 1 month ago

Looking back at the recent release of mcore 0.9.0:

Known Issue:
When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.

Do you know from which version this breaking change occurred? What are the effects of this issue during training?

@ko3n1g

ko3n1g commented 1 month ago

Hey @sbmaruf, this issue is due to this line using and instead of , to separate the rng and fp8 contexts. It's been like this since the introduction of the fp8 context in June 2023. Since we don't use dropout in many models that we train internally we haven't studied the impact this has on training.

sbmaruf commented 3 weeks ago

Thanks for reply.