The pretokenizer utility (pretokenizer/pretokenize.py) allows to tokenize datamixes in advance for use with the epfLLM/Megatron-LLM/ trainer.
The datamix configuration can be defined in a yaml file similarly to the classic training configurations of trainer_sft.py. For loading the datasets the functions from model_training are used (therefore the model_training module needs to be installed).
The pretokenizer utility (
pretokenizer/pretokenize.py
) allows to tokenize datamixes in advance for use with the epfLLM/Megatron-LLM/ trainer.The datamix configuration can be defined in a yaml file similarly to the classic training configurations of trainer_sft.py. For loading the datasets the functions from
model_training
are used (therefore the model_training module needs to be installed).