Add pretokenizer utility

LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.

https://open-assistant.io

Apache License 2.0

37.1k stars 3.24k forks source link

Add pretokenizer utility #3654

Closed andreaskoepf closed 1 year ago

andreaskoepf commented 1 year ago

The pretokenizer utility (pretokenizer/pretokenize.py) allows to tokenize datamixes in advance for use with the epfLLM/Megatron-LLM/ trainer.

The datamix configuration can be defined in a yaml file similarly to the classic training configurations of trainer_sft.py. For loading the datasets the functions from model_training are used (therefore the model_training module needs to be installed).