NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

OPTIM get_batch traffic when enable context-parallel #885

Open Superkeyv opened 1 week ago

Superkeyv commented 1 week ago

we can split batch's sequence-length before broadcast in tp_group, which can save time in get_batch