LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.85k stars 3.22k forks source link

Add pretokenizer utility #3654

Closed andreaskoepf closed 11 months ago

andreaskoepf commented 11 months ago

The pretokenizer utility (pretokenizer/pretokenize.py) allows to tokenize datamixes in advance for use with the epfLLM/Megatron-LLM/ trainer.

The datamix configuration can be defined in a yaml file similarly to the classic training configurations of trainer_sft.py. For loading the datasets the functions from model_training are used (therefore the model_training module needs to be installed).