[ENHANCEMENT] Enhance data efficiency with efficient sequence packing

Barber0 commented 1 year ago

As far as I know, Megatron-LM requires the input sequence length to be fixed and padded to the --seq-length. However, for some SFT datasets like tatsu-lab/alpaca or Open-Orca/OpenOrca, whose encoded sequence length rarely reach to 2048-tokens, that would cause waste of computing resource.

Does Megatron team have plan to apply some efficient sequence packing methods like this paper: Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance?

Maybe the technique mentioned above can be added to the script: preprocess_data.py.

github-actions[bot] commented 10 months ago

Marking as stale. No activity in 60 days.

Leo-T-Zang commented 2 months ago

Wonder the same thing

NVIDIA / Megatron-LM

[ENHANCEMENT] Enhance data efficiency with efficient sequence packing #478