ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.27k stars 9.2k forks source link

Bug: b3028 breaks mixtral 8x22b #7969

Closed steampunque closed 1 week ago

steampunque commented 2 months ago

What happened?

Mixtral 8x22b model running with server.

b3027: good lm hi Hello! How can I help you today? Is there something specific you would like to talk about or ask about? I'm here to provide information and answer questions to the best of my ability.

b3028: garbage lm hi πŸ‘‹

[INST] I'm here to help you with your questions about the [/INST] πŸ€“

[INST] I can provide information on a variety of topics, such as [/INST] πŸ“š

[INST] - [/INST] 🏫

Name and Version

b3027 for good run b3028 for broken run

What operating system are you seeing the problem on?

Linux

Relevant log output

na
nazarov-yuriy commented 2 months ago

Could you point the exact model file you used?

steampunque commented 2 months ago

Could you point the exact model file you used?

I tested on 3 different files: 1) https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUF/blob/main/Mixtral-8x22B-Instruct-v0.1.Q4_K_M-00001-of-00002.gguf

2), 3) My own quants made with latest version with model downloaded straight from mistralai, https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1, testing IQ4_XS and Q4_K_S

b3028 tokenizer patch breaks all of them.

I plan to run IQ4_XS once bug is corrected as it appears to work better than Q4_K_S on other models and also is 2G smaller for this big model.

steampunque commented 2 months ago

I added b3028 revert patches to my github here: https://github.com/steampunque/llama.cpp-patches , covering time span of original patch to last update as of today (b3173).

I don't recommend applying the reverts unless you want to run this model as other wanted changes in newer versions may be erased.

Without patch, b3173:

lm hi

[INST]hi [/INST]

[INST]hi [/INST]

[INST]hi [/INST]

[INST]hi [/INST]

[INST]hi [/INST]

[INST]hi [/INST]

Apply revert_b3028_from_3173.patch :

lm hi
 Hello! How can I help you today? Is there something specific you would like to talk about or ask me a question about? I'm here to provide information and answer any questions you have to the best of my ability.
ggerganov commented 2 months ago

Post the llama-server and curl commands that you used, otherwise we can't help

steampunque commented 2 months ago

Post the llama-server and curl commands that you used, otherwise we can't help

b3181 with no revert. Use any of the model files I summarized earlier.

llama-cli -m /datahd/models/Mixtral-8x22B-Instruct-v0.1.IQ4_XS.gguf      --color -n -1 --multiline-input --interactive-first --log-disable    -ngl 7 -c 8192 -ctk f16 -ctv f16 -b 128 -fa -n 8192 --keep 0    --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0    --top-k 40 --top-p 0.95 --typical 1.0 --min-p 0.00    --repeat-last-n 64 --repeat-penalty 1.0    --presence-penalty 0.0 --frequency-penalty 0.0    --tfs 1.0    --mirostat 0 --mirostat-lr 0.1 --mirostat-ent 5.0    -p "" --in-prefix "[INST] " --in-suffix " [/INST]"

Aborted conversation:

[INST] Hi 
 [/INST]

   [INST]  I'm here to help you with your questions about the 2022 Toyota Corolla.
   [/INST]

   [INST]  What can I assist you with today?
   [/INST]

   [USER]  What are the dimensions of the 2022 Toyota Corolla?
   [/USER]

   [INST][INST] 

b3181 with b3028 reverted:

[INST] Hi 
 [/INST] Hello! How can I help you today? Is there something specific you would like to know or discuss? I'm here to answer any questions you have to the best of my ability.
[INST] 
ggerganov commented 2 months ago

I don't have 8x22B handy, but with 8x7B I am not able to reproduce using that command:

make -j && ./llama-cli -m ./models/mixtral-instruct-8x7b-fast/ggml-model-q4_k.gguf --color -n -1 --multiline-input --interactive-first -ngl 7 -c 8192 -ctk f16 -ctv f16 -b 128 -fa -n 8192 --keep 0    --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0    --top-k 40 --top-p 0.95 --typical 1.0 --min-p 0.00    --repeat-last-n 64 --repeat-penalty 1.0    --presence-penalty 0.0 --frequency-penalty 0.0    --tfs 1.0    --mirostat 0 --mirostat-lr 0.1 --mirostat-ent 5.0    -p "" --in-prefix "[INST] " --in-suffix " [/INST]" --verbose-prompt
[INST] Hi 
 [/INST] Hello! How can I help you today? If you have any questions about a particular topic or just want to chat, feel free to ask me anything. I'm here to provide information and engage in a friendly conversation.
[INST] 

You can try to remove the whitespaces from the instruction suffix/prefix: --in-prefix "[INST]" --in-suffix "[/INST]" - these are incorrect

ggerganov commented 2 months ago

I downloaded the model and indeed there is a regression for the Mixtral 8x22B models. The [INST] and [/INST] tokens are no longer tokenized correctly after (#7500)

Before:

./tokenize -m Mixtral-8x22B-Instruct-v0.1.IQ4_XS-00001-of-00002.gguf -p "[INST]"

     3 -> '[INST]'

Now:

  1501 -> ' ['
 17057 -> 'INST'
 29561 -> ']'

cc @jaime-m-p

steampunque commented 2 months ago

I downloaded the model and indeed there is a regression for the Mixtral 8x22B models. The [INST] and [/INST] tokens are no longer tokenized correctly after (#7500)

Before:

./tokenize -m Mixtral-8x22B-Instruct-v0.1.IQ4_XS-00001-of-00002.gguf -p "[INST]"

     3 -> '[INST]'

Now:

  1501 -> ' ['
 17057 -> 'INST'
 29561 -> ']'

cc @jaime-m-p

Below appears to be the list of special tokens the model wants from tokenizer.json. I think mistral instruct v0.3 moved to special token for [INST] [/INST] also (same time the function stuff was added and vocabulary expanded) and it works fine without reverting the patch. Looking at differences in the modes, it appears mistral instruct v0.3 also defines the special tokens in tokenizer_config.json while mixtral 8x22b does not, so that may explain the problem, but the mystery remains as to why special tokens conversion ever worked with older builds.

{ "version": "1.0", "truncation": null, "padding": null, "added_tokens": [ { "id": 0, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 1, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 2, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 3, "content": "[INST]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 4, "content": "[/INST]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 5, "content": "[TOOL_CALLS]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 6, "content": "[AVAILABLE_TOOLS]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 7, "content": "[/AVAILABLE_TOOLS]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 8, "content": "[TOOL_RESULTS]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true }, { "id": 9, "content": "[/TOOL_RESULTS]", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "special": true } ],

jaime-m-p commented 2 months ago

I'm trying to find the root of the problem.

Found some differences while loading special tokens:

dir_tokenizer = "'./models/tokenizers//mixtral8x22b'"
tokenizer = AutoTokenizer.from_pretrained(dir_tokenizer)
tokenizer.added_tokens_encoder
{'<unk>': 0, '<s>': 1, '</s>': 2, '[INST]': 3, '[/INST]': 4, '[TOOL_CALLS]': 5, '[AVAILABLE_TOOLS]': 6, '[/AVAILABLE_TOOLS]': 7, '[TOOL_RESULTS]': 8, '[/TOOL_RESULTS]': 9}
import gguf
dir_tokenizer = "'./models/tokenizers//mixtral8x22b'"
special_vocab = gguf.SpecialVocab(dir_tokenizer)
special_vocab.special_token_ids
{'bos': 1, 'eos': 2, 'unk': 0}

I have to look closer, but I'm confused in the method gguf.SpecialVocab._try_load_from_tokenizer_json(). It is loading both files tokenizer.json and tokenizer_config.json, then tryes to match tokens with the list special_token_types = ('bos', 'eos', 'unk', 'sep', 'pad', 'cls', 'mask'), doing an intersection instead of an union, lossing all new special tokens.

but the mystery remains as to why special tokens conversion ever worked with older builds.

Probably this: https://github.com/ggerganov/llama.cpp/pull/7500/commits/938cb4941a594f79a44c36c128645148a69ce6be. Instead of doing a search of un-mergeable tokens (assuming they are special), now it trust the toktypes from the GGUF file. This fixed some problems with other models, but I see it breaks others..

I think the correct option is try to fix gguf.SpecialVocab._try_load_from_tokenizer_json(), but this forces to regenerate the GGUF files. Not sure if this is aceptable, but the second option is to reintroduce the old behaviour and try to guess with models belong to each option.

steampunque commented 2 months ago

I think the correct option is try to fix gguf.SpecialVocab._try_load_from_tokenizer_json(), but this forces to regenerate the GGUF files. Not sure if this is aceptable, but the second option is to reintroduce the old behaviour and try to guess with models belong to each option.

In my opinion regen gguf should not be an issue. It has happened multiple times in past with BPE updates and will no doubt continue into future as models evolve with new tokenizers and new vocabularies and other needs. It would be good to output a warning messaging saying that the model should be updated if at all possible to detect the condition. Also might be possible to create a utility to just update metadata to avoid the pain of re converting and re quanting the larger models. This particular issue might wipe a lot of models leaning on special tokens adds though.

ggerganov commented 2 months ago

If the gguf.SpecialVocab._try_load_from_tokenizer_json() is the problem, we should fix it. We can require to regenerate such models. Would be nice to print out a warning for old GGUFs that have the problem, but I'm not sure if we will be able to detect outdated ones

Azirine commented 2 months ago

I reproduced this in main. Input: hi

./main -m Mixtral-8x22B-Instruct-v0.1.i1-Q4_K_M.gguf --no-mmap -fa -ins --in-prefix "[INST] " --in-suffix "[/INST] " -s 0

b3027β€”no problems

> [INST] hi
[/INST] Hello! How can I assist you today? If you have any questions or need help with something, feel free to ask. I'm here to help!

> [INST]  

b3028-b3086β€”strange prefix/suffix tokens

> [INST] hi
[/INST]  [RESP]  Hello! How can I help you today? [/RESP] 

### Instruction:

> [INST] 
./main -m Mixtral-8x22B-Instruct-v0.1.i1-Q4_K_M.gguf --no-mmap -fa -if --in-prefix "[INST] " --in-suffix "[/INST] " -s 0

b3087-b3140β€”random mention of IELTS

[INST] hi
[/INST]  hi there! How can I help you today? If you have any questions about the IELTS exam or need some advice on how to prepare for it, feel free to ask. I'm here to help you out.
[INST] 
steampunque commented 2 months ago

I reproduced this in main. Input: hi

Its still broken as of b3233. It is necessary to revert b3027->b3028 update if you want to run the model.

Azirine commented 1 month ago

This also broke Yi-1.5-34b.

./main -m Yi-1.5-34B-Chat-16K-Q4_K_M.gguf --no-mmap -fa -cml -s 0 --special

Input: hi b3027

<|im_start|> system
<|im_end|>
<|im_start|> user

> hi
Hello! How can I assist you today?<|im_end|>

> 

b3028

 <|im_start|>system
<|im_end|>
> hi
Hello! How can I assist you today? <|im_end|>
<|im_start|>user

> 

In b3028, <|im_end|> <|im_start|>user are generated character by character, showing that they are no longer treated as special tokens.

steampunque commented 1 month ago

Update. MistralAI updated the tokenizer configs for 8x22b a couple days ago so I ran a new convert/quant and all looks OK now:

SPECIAL=1 tokenize "[INST][/INST][TOOL_CALLS][AVAILABLE_TOOLS][/AVAILABLE_TOOLS][TOOL_RESULTS][/TOOL_RESULTS]" [1,2,3,4,5,6,7,8,9]

lm Hello. Hello! How can I help you today?

I tested with 3 version : b3266 with b3028 revert patch, b3266 with no revert patch, and latest b3324 with no revert patch and all were good on the new convert. So it looks like Mistral guys fixed the problem on their side, however it will require re converting from their new update as of a couple days ago.

I will leave this issue open for now since the bug breaks other models too, but mixtral 8x22b is now good due to the mistralai 8x22b update.

shibe2 commented 1 month ago

Will there be a proper support for special tokens such that the model would see the difference between, for example, special token [INST] and string "[INST]" occurring somewhere in the text?

ggerganov commented 1 month ago

There is already support - see the parse_special flag of llama_tokenize

shibe2 commented 1 month ago

Right. What I meant is support in examples.

ggerganov commented 1 month ago

There will be, but can't say when - does not seem very high prio IMO

github-actions[bot] commented 1 week ago

This issue was closed because it has been inactive for 14 days since being marked as stale.