Spaces are not being added after added tokens when `legacy: true` is used

xzuyn commented 2 months ago

I think LLaMa-1, LLaMa-2, Mistral-v0.1, Mistral-v0.2, Solar (which is based on Mistral-v0.1), and probably a few others all use "legacy": true.

Trainers like Axolotl will use Transformers to tokenize datasets for training, which if this setting is set to true will add a space after special/added tokens. A bit weird in my opinion, but that's probably why they consider it legacy. Weirdness aside this all seems fine as long as inference tokenization matches training tokenization, which happens with anything that uses Transformers, but doesn't seem to be the case with llama.cpp.

LLaMa-3 has no mention of legacy in its tokenizer_config.json so they likely no longer follow this behaviour, and llama.cpp won't need any changes in this regard and works as is.

I used the latest KoboldCPP here because I can't figure out how to tokenize with llama.cpp since I only ever use KoboldCPP and wanted to make this issue sooner rather than later. I assume they tokenize the same.

from transformers import AutoTokenizer
import requests

string_to_test = "<|im_start|>user\nTest Input<|im_end|>\n<|im_start|>assistant\nTest Response<|im_end|>"

# https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02
# Transformers version: 4.40.1 (Latest)
tokenizer = AutoTokenizer.from_pretrained("cognitivecomputations/dolphin-2.8-mistral-7b-v02")

# https://huggingface.co/bartowski/dolphin-2.8-mistral-7b-v02-GGUF/blob/main/dolphin-2.8-mistral-7b-v02-Q4_K_S.gguf
# KoboldCPP version: 1.64.1 (Latest)
koboldcpp_string_to_test = (
    requests.post(
        f"http://127.0.0.1:5001/api/extra/tokencount",
        json={"prompt": string_to_test},
    ).json()["ids"]
)

print(tokenizer.encode(string_to_test))
# [1, 32001, 2188, 13, 1963, 11232, 32000, 28705, 13, 32001, 13892, 13, 1963, 12107, 32000]
# ['<s>', '<|im_start|>', '▁user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '▁', '<0x0A>', '<|im_start|>', '▁assistant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']

print(koboldcpp_string_to_test)
# [1, 32001, 1838, 13, 1963, 11232, 32000, 13, 32001, 489, 11143, 13, 1963, 12107, 32000]
# ['<s>', '<|im_start|>', 'user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '<0x0A>', '<|im_start|>', 'ass', 'istant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']

@ehartford pinging you here since I used your model to test and figured you would want to know about this behaviour.

steampunque commented 2 months ago

I think the MistralAI models behave differently compared to chatml based models. This note is from the mistralai 8x7B model on HF:

" As reference, here is the pseudo-code used to tokenize instructions during fine-tuning:

def tokenize(text): return tok.encode(text, add_special_tokens=False)

[BOS_ID] + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") + tokenize(BOT_MESSAGE_1) + [EOS_ID] + … tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") + tokenize(BOT_MESSAGE_N) + [EOS_ID]

In the pseudo-code above, note that the tokenize method should not add a BOS or EOS token automatically, but should add a prefix space.

In the Transformers library, one can use chat templates which make sure the right format is applied. "

The problem I found was the jinja chat templates of even the official mistralai models on HF do not seem to correspond to the above note. However, when I implement the tokenization exactly as described above through a patched llama.cpp where I take complete control of both chat template definitions and whether the space prefix is added in the llama_tokenize() call mistral 8x7B seemed to behave noticeably differently on a couple short test prompts (got rid of a leading : in one response which should not have been there).

The dolphin 2.8 transformers tokenization can be matched using these template definitions with spaces manually added in the template definitions:

# ChatML with special tokens and inserted spaces to match transformers tokenizer spaces CHATML_BOH="<|im_start|>" CHATML_EOS="<|im_end|>" CHATML_SYSTEM="${CHATML_BOH} system\n" CHATML_USER="${CHATML_BOH} user\n" CHATML_ASSISTANT="${CHATML_BOH} assistant\n" CHATML_SUFFIX="$CHATML_EOS \n"

Most likely llama3 instruct from Meta does not have these spaces (guessing), but I am not sure about the new dolphin 2.9 and hermes2 pro chatml based llama3 fine tunes coming out if they are still in there or not.

I think the bottom line is exactly matching instruct tune templates is going to always be hit or miss unless the model creators define text in -> token out test cases for one full turn in their documentation exactly as you have done in your issue note. I am still not sure about Mistral 7B 0.1, 0.2, 8x22b etc. if they are changing this space thing from fine tune to fine tune or not and it seems hard to reverse engineer from the model itself except by testing with and without spaces in various places and empirically determining which works best.

xzuyn commented 1 month ago

This issue still exists.

github-actions[bot] commented 16 minutes ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

Spaces are not being added after added tokens when `legacy: true` is used #7094