ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.83k stars 9.29k forks source link

Bug: Tokenizer not working on partial UTF-8 bytes #8691

Open riedgar-ms opened 1 month ago

riedgar-ms commented 1 month ago

What happened?

Note: Discovered by one of the users of Guidance.

If a multibyte UTF-8 character is encoded to two tokens, LlamaCpp is unable to tokenise the byte representation of one of the tokens.

To see this:

printf '\xe6\xad\xaa'
歪

printf '\xe6\xad\xaa' | ./llama-tokenize -m ~/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-IQ3_S.gguf --stdin --no-bos
<snip>
 15722 -> '�'
   103 -> '�'

So the single character '歪' has encoded to two tokens, 15722 and 103. A little further investigation (via the Python interfaces) revealed that 15722 maps to the bytes \xe6\xad. However, if we try running this through llama-tokenize:

 printf '\xe6\xad' | ./llama-tokenize -m ~/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-IQ3_S.gguf --stdin --no-bos
<snip>
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted (core dumped)

which is not particularly helpful (and breaks Guidance). Rather than the C++ exception, we were expecting the output 15722 -> '�' for the reduced tokenisation request.

This has been tested on Linux and Windows.

For reference, the original issue filed on Guidance: https://github.com/guidance-ai/guidance/issues/934

Name and Version

./llama-cli --version version: 3460 (ed67bcb2) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux, Windows

Relevant log output

No response

riedgar-ms commented 4 weeks ago

Any update on this?