OoriData / OgbujiPT

Client-side toolkit for using large language models, including where self-hosted
Apache License 2.0
101 stars 8 forks source link

Add model styles w/out BOS #70

Open chimezie opened 7 months ago

chimezie commented 7 months ago

When doing low-level finetuning (without the aid of HF's SFTtrainer library, for example), you may need to be able to tokenize a string with the model's prompting format but without special characters (BOS and EOS). Mistral is the only model format in model_style.py with a distinct BOS character separate from other delimiters (such as <|im_start|> for chatml, which is not really a BOS but a structured delimiter). So, I added a template without the

\

BOS character. In any case, BOS characters would usually be added by the tokenizer before the prompt is fed to the model and this would therefore happen downstream from OgbujiPT.