Bug: Gemma2 tokenization seems incorrect.

AUTOMATIC1111 commented 3 months ago

tokenizer.json from gemma2 has this token: "[toxicity=0]": 255968.

When tokenizing that text using llamacpp, we get [235309, 1373, 235293, 235276, 235307]

If I ask llamacapp gemma2 to repeat this text, [toxicity=0], it does so effortlessly.

If I ask corpo hosted gemma2 to repeat it, it fails, thinking there's no text there:

chrome_oG2qUJimh4

version: 3317 (8e558309) built with MSVC 19.29.30154.0 for x64

Windows

No response

cuelebra commented 3 months ago

confirming, LMSYS:

llama.cpp gemma 27b Q8_0:

<start_of_turn>user
Repeat the following text:
START->[toxicity=0]<-END<end_of_turn>
<start_of_turn>model

START->[toxicity=0]<-END

ngxson commented 3 months ago

ggerganov / llama.cpp