Command-R (CohereForAI model) tokenization disagrees with HF implementation

Noeda commented 6 months ago

Command-R support was recently merged here: https://github.com/ggerganov/llama.cpp/pull/6033

This issue is also discussed here where I initially thought it might be a bug on HF implementation side: https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27

The model uses BPE; however something with tokenization is not exactly the same. I don't think it has any major impact on output quality but it does lead to the implementations disagreeing with top logits a little bit in some of my tests.

To test Command-R tokens we can use this with the HF model:

#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

test_string = """
This is a sentence.

### Sentence
"""

print(tokenizer.encode(test_string))

# -> [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]

Llama.cpp comparison (I hacked tokenize to read the string from a file given by filepath to argv[2] instead of tokenizing argv[2]...do any of the cli tools print the tokens without having to do that?)

$ tokenize ~/text-generation-webui/models/commandr_dev_f16.gguf tokens_test2
<omitted output until token list>

     5 -> ''
   206 -> '
'
  4184 -> 'This'
  1801 -> ' is'
  1671 -> ' a'
 27281 -> ' sentence'
    21 -> '.'
  2126 -> '

'
  2680 -> '###'
195143 -> ' Sentence'
   206 -> '
'

To put the token lists side by side for readability:

HF:
# [5, 206, 4184, 1801, 1671, 27281, 21, 206, 206, 2680, 195143, 206]

llama.cpp
# [5, 206, 4184, 1801, 1671, 27281, 21, 2126,     2680, 195143, 206]

The part that's different is two 206s vs one 2126. (206 = '\n', 2126 = '\n\n').
As far as I can tell, both implementations will decode back to the original strings exactly always.

The tokenizers don't seem exactly the same. It seems that llama.cpp is more eager to give 2126 for \n\n than HF version.

I verified with Cohere that their implementation is correct (https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/27) I initially thought llama.cpp was correct and theirs was buggy.

The model might be slightly smarter if we match the tokenization, as then it would match how the model was trained. Empirically testing around I really don't think this impacts output quality in any material way, but it can influence ordering of top tokens a bit that can be noticable. I had the logits reorder themselves in a llama.cpp vs HF test prompt of about 2200 tokens where 7 tokens diverged (all of them two 206s vs one 2126). Maybe with particular kinds of prompts the divergence in tokenization would be much greater and output much different.

I'll offer to investigate and do a PR with an ETA some time next week when I can invest more time. Haven't read the tokenization code on either HF or llama.cpp yet as of opening this issue.

ggerganov commented 6 months ago

The most likely reason is that llama.cpp currently implements a specific BPE pre-process regex. I'm not even sure which one exactly anymore - likely the one used in GPT-2 per this comment, though one should verify:

https://github.com/ggerganov/llama.cpp/blob/b5f4ae09c3244ae1644b67c03ed9f4227ab25ad2/llama.cpp#L9989-L9999

I would guess that Command-R uses some other regex. AFAIK each BPE-based model can use an arbitrary regex:

https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py?rgh-link-date=2024-03-04T08%3A25%3A04Z

The problem becomes more significant because in C++ we can't simply use built-in regex functions because they don't fully support unicode (or something like that, I'm no expert). So we have to do stuff like that: https://github.com/ggerganov/llama.cpp/pull/4070. But then, this becomes very slow, so we have to custom-implement regex functions like the bpe_gpt2_preprocess() above

So long story short, the BPE tokenization in llama.cpp has to be improved. I've create an issue and sketched a rough plan what needs to be done: https://github.com/ggerganov/llama.cpp/issues/5981

Noeda commented 6 months ago

Aha. Thanks @ggerganov for the info. I'll be watching around on the regex stuff and maybe will help with the regex stuff depending on how much I'll have my own time to invest.

daulet commented 6 months ago

Cohere and a lot of other models use HuggingFace's tokenizer, so a drop in fix is to use the library for tokenization (just feed corresponding tokenizer config, e.g. this) and avoid reimplementation and future maintenance in this project. The only issue that library is in rust.

@ggerganov would you be open to brining rust into llama.cpp?

ggerganov commented 6 months ago

3rd party projects can always use an external library to tokenize and pass the tokens to llama.cpp. It's not very convenient, but for now I don't want to incorporate a specific 3rd party implementation in the project. It's unfortunate that we don't properly support all sorts of tokenizers, but hopefully with time we will improve to at least support the important bits

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104