So the single character '歪' has encoded to two tokens, 15722 and 103. A little further investigation (via the Python interfaces) revealed that 15722 maps to the bytes \xe6\xad. However, if we try running this through llama-tokenize:
printf '\xe6\xad' | ./llama-tokenize -m ~/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-IQ3_S.gguf --stdin --no-bos
<snip>
terminate called after throwing an instance of 'std::invalid_argument'
what(): invalid character
Aborted (core dumped)
which is not particularly helpful (and breaks Guidance). Rather than the C++ exception, we were expecting the output 15722 -> '�' for the reduced tokenisation request.
What happened?
Note: Discovered by one of the users of Guidance.
If a multibyte UTF-8 character is encoded to two tokens, LlamaCpp is unable to tokenise the byte representation of one of the tokens.
To see this:
So the single character '歪' has encoded to two tokens, 15722 and 103. A little further investigation (via the Python interfaces) revealed that 15722 maps to the bytes
\xe6\xad
. However, if we try running this throughllama-tokenize
:which is not particularly helpful (and breaks Guidance). Rather than the C++ exception, we were expecting the output
15722 -> '�'
for the reduced tokenisation request.This has been tested on Linux and Windows.
For reference, the original issue filed on Guidance: https://github.com/guidance-ai/guidance/issues/934
Name and Version
./llama-cli --version version: 3460 (ed67bcb2) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux, Windows
Relevant log output
No response