ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.87k stars 9.46k forks source link

Bug: Gemma2 tokenization seems incorrect. #8349

Closed AUTOMATIC1111 closed 2 months ago

AUTOMATIC1111 commented 3 months ago

What happened?

tokenizer.json from gemma2 has this token: "[toxicity=0]": 255968.

When tokenizing that text using llamacpp, we get [235309, 1373, 235293, 235276, 235307]

If I ask llamacapp gemma2 to repeat this text, [toxicity=0], it does so effortlessly.

If I ask corpo hosted gemma2 to repeat it, it fails, thinking there's no text there:

chrome_oG2qUJimh4

Name and Version

version: 3317 (8e558309) built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

cuelebra commented 3 months ago

confirming, LMSYS:

image

llama.cpp gemma 27b Q8_0:

<start_of_turn>user
Repeat the following text:
START->[toxicity=0]<-END<end_of_turn>
<start_of_turn>model

START->[toxicity=0]<-END
ngxson commented 3 months ago

Please refer to: https://github.com/ggerganov/llama.cpp/issues/8240#issuecomment-2212444937