Missing tokenizer tests

ggerganov / llama.cpp

LLM inference in C/C++

MIT License

64.81k stars 9.29k forks source link

Missing tokenizer tests #3730

Closed goerch closed 10 months ago

goerch commented 10 months ago

AFAIU we are missing tokenizer tests for supported models like

Baichuan
Bloom
GptNeoX
Persimmon
Refact
Starcoder

It would be great if anyone would be helping out.

Galunid commented 10 months ago

If it's the same deal as what I did in the stablelm PR, I can do that tomorrow

Galunid commented 10 months ago

Should we apply this fix to conversion scripts of other gpt2-tokenizer based models before generating the vocab files?

for i in range(vocab_size):
-    tokens.append(reverse_vocab[i] if i in reverse_vocab else f"[PAD{i}]")
-    scores.append(0.0) # dummy
-    toktypes.append(gguf.TokenType.NORMAL)
+    if i not in reverse_vocab:
+        tokens.append(f"[PAD{i}]")
+        toktypes.append(gguf.TokenType.USER_DEFINED)
+    elif reverse_vocab[i] in added_vocab:
+        # NOTE: wouldn't we like to distinguish CONTROL tokens here?
+        tokens.append(reverse_vocab[i])
+        toktypes.append(gguf.TokenType.USER_DEFINED)
+    else:
+        tokens.append(reverse_vocab[i])
+        toktypes.append(gguf.TokenType.NORMAL)

goerch commented 10 months ago

Should we apply this fix to conversion scripts of other gpt2-tokenizer based models before generating the vocab files?

Good idea, but no: let us see and fix the issues accordingly.

goerch commented 10 months ago

@Galunid : I retested conversion of mpt and tested conversion of gpt-neox with the following code

added_vocab = tokenizer.get_added_vocab()
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}

for i in range(vocab_size):
    if i not in reverse_vocab:
        tokens.append(f"[PAD{i}]")
        toktypes.append(gguf.TokenType.USER_DEFINED)
    elif reverse_vocab[i] in added_vocab:
        tokens.append(reverse_vocab[i])
        if tokenizer.added_tokens_decoder[i].special:
            toktypes.append(gguf.TokenType.CONTROL)
        else:
            toktypes.append(gguf.TokenType.USER_DEFINED)
    else:
        tokens.append(reverse_vocab[i])
        toktypes.append(gguf.TokenType.NORMAL)

gguf_writer.add_token_list(tokens)
gguf_writer.add_token_types(toktypes)

incoorperating @jploski 's explanation of how to detect special tokens. Tests work still (mpt) and now(gpt-neox). So I think it is fine for me if you go ahead and fix the conversion scripts of the gpt2-tokenizer this way. Thanks for your help!

Galunid commented 10 months ago

Sure, to clarify, you mean to rework all the gpt2 conversion scripts and update the vocabs in the test PR?

goerch commented 10 months ago

Sure, to clarify, you mean to rework all the gpt2 conversion scripts and update the vocabs in the test PR?

As you like and your time permits. I hope someone will object if this is the wrong way to go.