allenai / open-instruct

Apache License 2.0
1.1k stars 145 forks source link

wrong tokenization about eos token #113

Closed Hannibal046 closed 4 months ago

Hannibal046 commented 5 months ago

Hi, teams!

Thanks so much for the great repo! I want to report a problem about add eos token in the finetune.py https://github.com/allenai/open-instruct/blob/763206f5b8fe340c7833c438113f69ba7bca8886/open_instruct/finetune.py#L277 https://github.com/allenai/open-instruct/blob/763206f5b8fe340c7833c438113f69ba7bca8886/open_instruct/finetune.py#L311 Here is the reason why this might cause problem:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
text = "a random text"
text = text + tokenizer.eos_token
print(tokenizer.tokenize(text))

text = text + " " + tokenizer.eos_token
print(tokenizer.tokenize(text))

image

hamishivi commented 5 months ago

Hi,

thanks for the issue! Curiously, we don't see this issue with the tokenizer from the official facebook release:

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer("hello</s>")
{'input_ids': [1, 22172, 2], 'attention_mask': [1, 1, 1]}
>>> tokenizer("hello </s>")
{'input_ids': [1, 22172, 29871, 2], 'attention_mask': [1, 1, 1, 1]}

But I do see it with other uploaded llama tokenizers....:

>>> tokenizer = AutoTokenizer.from_pretrained("oobabooga/llama-tokenizer", legacy=False)
>>> tokenizer("hello</s>")
{'input_ids': [1, 22172, 829, 29879, 29958], 'attention_mask': [1, 1, 1, 1, 1]}
>>> tokenizer("hello </s>")
{'input_ids': [1, 22172, 2], 'attention_mask': [1, 1, 1]}

Looking at the configs, they are not identical.... Our code behaves correctly when we use the official tokenizer, and adding a space actually changes the tokenization with the official tokenization:

>>> tokenizer("hello </s>")
{'input_ids': [1, 22172, 29871, 2], 'attention_mask': [1, 1, 1, 1]}

As such, I don't really want to change this, but this is a useful find. Perhaps you can edit the script locally to match your use case, if you need to use this tokenizer?

Hannibal046 commented 5 months ago

Hi, Thanks so much for the response and detailed investigation! I would update the script locally.

Hannibal046 commented 4 months ago

Hi, @hamishivi

I found a compatible way by using add_eos_token=True when initialize a Tokenizer. image

Hannibal046 commented 4 months ago

Hi, @hamishivi Sorry for bother you again.

But could you please check this issue? https://github.com/huggingface/transformers/issues/29375

Can you replicate this behaviour on your side?

Hannibal046 commented 4 months ago

Unrelated since tulu did not use fast tokenizer