NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.86k stars 2.23k forks source link

[ENHANCEMENT] Enhance data efficiency with efficient sequence packing #478

Closed Barber0 closed 2 weeks ago

Barber0 commented 1 year ago

As far as I know, Megatron-LM requires the input sequence length to be fixed and padded to the --seq-length. However, for some SFT datasets like tatsu-lab/alpaca or Open-Orca/OpenOrca, whose encoded sequence length rarely reach to 2048-tokens, that would cause waste of computing resource.

Does Megatron team have plan to apply some efficient sequence packing methods like this paper: Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance?

Maybe the technique mentioned above can be added to the script: preprocess_data.py.

github-actions[bot] commented 10 months ago

Marking as stale. No activity in 60 days.

Leo-T-Zang commented 2 months ago

Wonder the same thing