As far as I know, Megatron-LM requires the input sequence length to be fixed and padded to the --seq-length. However, for some SFT datasets like tatsu-lab/alpaca or Open-Orca/OpenOrca, whose encoded sequence length rarely reach to 2048-tokens, that would cause waste of computing resource.
As far as I know, Megatron-LM requires the input sequence length to be fixed and padded to the
--seq-length
. However, for some SFT datasets like tatsu-lab/alpaca or Open-Orca/OpenOrca, whose encoded sequence length rarely reach to 2048-tokens, that would cause waste of computing resource.Does Megatron team have plan to apply some efficient sequence packing methods like this paper: Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance?
Maybe the technique mentioned above can be added to the script: preprocess_data.py.