When doing low-level finetuning (without the aid of HF's SFTtrainer library, for example), you may need to be able to tokenize a string with the model's prompting format but without special characters (BOS and EOS). Mistral is the only model format in model_style.py with a distinct BOS character separate from other delimiters (such as <|im_start|> for chatml, which is not really a BOS but a structured delimiter). So, I added a template without the
\
BOS character. In any case, BOS characters would usually be added by the tokenizer before the prompt is fed to the model and this would therefore happen downstream from OgbujiPT.
When doing low-level finetuning (without the aid of HF's SFTtrainer library, for example), you may need to be able to tokenize a string with the model's prompting format but without special characters (BOS and EOS). Mistral is the only model format in model_style.py with a distinct BOS character separate from other delimiters (such as <|im_start|> for chatml, which is not really a BOS but a structured delimiter). So, I added a template without the
BOS character. In any case, BOS characters would usually be added by the tokenizer before the prompt is fed to the model and this would therefore happen downstream from OgbujiPT.