Closed Hannibal046 closed 4 months ago
Hi,
thanks for the issue! Curiously, we don't see this issue with the tokenizer from the official facebook release:
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer("hello</s>")
{'input_ids': [1, 22172, 2], 'attention_mask': [1, 1, 1]}
>>> tokenizer("hello </s>")
{'input_ids': [1, 22172, 29871, 2], 'attention_mask': [1, 1, 1, 1]}
But I do see it with other uploaded llama tokenizers....:
>>> tokenizer = AutoTokenizer.from_pretrained("oobabooga/llama-tokenizer", legacy=False)
>>> tokenizer("hello</s>")
{'input_ids': [1, 22172, 829, 29879, 29958], 'attention_mask': [1, 1, 1, 1, 1]}
>>> tokenizer("hello </s>")
{'input_ids': [1, 22172, 2], 'attention_mask': [1, 1, 1]}
Looking at the configs, they are not identical.... Our code behaves correctly when we use the official tokenizer, and adding a space actually changes the tokenization with the official tokenization:
>>> tokenizer("hello </s>")
{'input_ids': [1, 22172, 29871, 2], 'attention_mask': [1, 1, 1, 1]}
As such, I don't really want to change this, but this is a useful find. Perhaps you can edit the script locally to match your use case, if you need to use this tokenizer?
Hi, Thanks so much for the response and detailed investigation! I would update the script locally.
Hi, @hamishivi
I found a compatible way by using add_eos_token=True
when initialize a Tokenizer.
Hi, @hamishivi Sorry for bother you again.
But could you please check this issue? https://github.com/huggingface/transformers/issues/29375
Can you replicate this behaviour on your side?
Unrelated since tulu did not use fast tokenizer
Hi, teams!
Thanks so much for the great repo! I want to report a problem about add
eos
token in thefinetune.py
https://github.com/allenai/open-instruct/blob/763206f5b8fe340c7833c438113f69ba7bca8886/open_instruct/finetune.py#L277 https://github.com/allenai/open-instruct/blob/763206f5b8fe340c7833c438113f69ba7bca8886/open_instruct/finetune.py#L311 Here is the reason why this might cause problem: