Potential Duplication of BOS Token - Githubissues

allenai / reward-bench

RewardBench: the first evaluation tool for reward models.

https://huggingface.co/spaces/allenai/reward-bench

Apache License 2.0

425 stars 50 forks source link

Potential Duplication of BOS Token #164

Closed chrisliu298 closed 2 months ago

chrisliu298 commented 2 months ago

I noticed that, for default (sequence classification) models with chat template defined in the tokenizer, scripts/run_rm.py formats each conversation by tokenizer.apply_chat_template (via the function prepare_dialogue_from_tokenizer) and then uses the text classification pipeline to process the formatted conversations. Given that 1) many models' tokenizers (e.g., Llama-3 instruct series, Gemma-2 instruct series, etc.) define the bos_token in the chat template, and 2) the pipeline adds another bos_token during tokenization, does it mean these models read in two bos tokens in the forward pass?

I also realized that some models (e.g., ArmoRM) inherently avoids this potential issue via customized pipeline by directly performing tokenization using tokenizer.apply_chat_template (as opposed to first formatting, then tokenizing).

natolambert commented 2 months ago

Hey @chrisliu298 -- we abstracted away from the HuggingFace pipeline to our own one to simplify some of this. Does the HF Tokenizer add a bos_token by default, see here https://github.com/allenai/reward-bench/blob/main/rewardbench/models/pipeline.py

natolambert commented 2 months ago

Ah okay, just ran an example, this seems right. Hmm looking.

natolambert commented 2 months ago

Minimal example:

from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained("oobabooga/llama-tokenizer")
out = tokenizer.call("Testing my text")
print(out)
print(tokenizer.bos_token)
print(tokenizer.convert_ids_to_tokens(out['input_ids'])

natolambert commented 2 months ago

@chrisliu298 -- it varies by model. Some examples (that should likely be fixed).

InternLM adds BOS token with chat template https://huggingface.co/internlm/internlm2-20b-reward/blob/6539b1df85e74019a3e36da43901e18205c572ea/tokenizer_config.json#L103
Tulu 2 does not https://huggingface.co/allenai/tulu-2-dpo-70b/blob/0ab5c875f0070d5aee8d36bc55f41de440a13f02/tokenizer_config.json#L35
Llama 3 does not https://huggingface.co/NCSOFT/Llama-3-OffsetBias-RM-8B/blob/ddb9f337b80726b15eb16faa7c20d02050ca178a/tokenizer_config.json#L2053

I think the solution is in the standard inference pipeline, check if bos_token gets doubled

natolambert commented 2 months ago

Example with the failure:

>>> chat = [
...    {"role": "user", "content": "Hello, how are you?"},
...    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
...    {"role": "user", "content": "I'd like to show off how chat templating works!"},
... ]
>>>
>>> tokenizer.
KeyboardInterrupt
>>> chat = tokenizer.apply_chat_template(chat, tokenize=False)
>>> chat
"<s><|im_start|>user\nHello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing great. How can I help you today?<|im_end|>\n<|im_start|>user\nI'd like to show off how chat templating works!<|im_end|>\n<|reward|>"
>>> tokenizer(chat)
{'input_ids': [1, 1, 92543, 1008, 364, 9843, 328, 1392, 657, 629, 345, 92542, 364, 92543, 525, 11353, 364, 295, 2940, 3890, 2395, 281, 2745, 777, 489, 1638, 629, 3514, 345, 92542, 364, 92543, 1008, 364, 295, 4330, 1217, 442, 1620, 1147, 1392, 6392, 1708, 631, 1237, 4437, 346, 92542, 364, 92527], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

I'll submit a small fix for off by one errors in the default pipeline.

natolambert commented 2 months ago

The top model with this issue that I've found is https://huggingface.co/weqweasdas/RM-Mistral-7B/blob/main/tokenizer_config.json, most models use custom code or do not have BOS token. The specific implementation of InternLM could be wrong, but I consider that separate.