Closed lucasjinreal closed 11 months ago
Have you set eos_token_id=tokenizer.convert_tokens_to_ids("</s>")
?
If you don't specify an eos_id, it will continue generating till it meets the max_tokens.
https://github.com/01-ai/Yi/issues/112#issue-1989966201
First I thought it was my conv template problem, but after changing to using fastchat like conv template, with <|endoftext|> as splitter.
the result still got a lot of
I am confused now, is that the data contains a lot of such data when training base model?
It shouldn't happen when sft got a lot of output .
It can set as a badwords or token id, but this shouldn't be since it's <|endoftext|> in tokenizer.
Hi @lucasjinreal , could you first confirm you've done as @loofahcus said above?
I think this is not about setting eos id.
I using conv template like fschat, but make the end of convesation is <|endoftext>| which is eos_token used in tokenizer, but the output generates a lot of </s>
, my sft dataset doesn/t contains any data like this, this is not about eos token, just model collapsed I think, I haven't got a model keep generate without genreate <|endoftext|> first;
even thought I can using as a stop words, but this could still a problem since it can not generate <|endoftext|> which suppose to be in trianing data.
but the output generates a lot of
Could you print the generated token id?
Same question here:
I added tokenizer.eos_token_id
to the end of the response input ids, but I got a \</s> token after the answer, why tokenizer.eos_token not taking effect at the SFT procdure?
training data format when do SFT is:
<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|endoftext|>
the inference script is:
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
# (model version: commit id: 4af5d306ab)
generate_args = {
'do_sample': True,
'temperature': 0.95,
'top_p': 0.7,
'top_k': 30,
'num_beams': 1,
'max_length': 512,
'max_new_tokens': 512,
'repetition_penalty': 1.0,
'length_penalty': 1.0,
'eos_token_id': [64001],
'pad_token_id': 0
}
logits = model.generate(
input_ids=curr_batch_input_ids, **generating_args
)
answer = tokenizer.decode(logits[0], skip_special_tokens=False)
print(answer)
... some padding tokens... <|im_start|> user
1+2=<|im_end|>
<|im_start|> assistant
3</s> ... some random tokens ...
the expected output is:
... some padding tokens... <|im_start|> user
1+2=<|im_end|>
<|im_start|> assistant
3<|endoftext|>
set eos_token_id
to [64001, 2] could solve this problem, but I'm wondering why <|endoftext|>
is not taking effect? (why did the sft model generate instead of it).
thanks
Hi @shihanmax,
Could you first confirm your original input is correctly encoded?
<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|endoftext|>
If the tokenizer is correctly configured, <|im_start|>
, <|im_end|>
and |endoftext|>
should be encoded as special tokens.
Hi @shihanmax,
Could you first confirm your original input is correctly encoded?
<|im_start|>user 1+1=<|im_end|> <|im_start|>assistant 2<|endoftext|>
If the tokenizer is correctly configured,
<|im_start|>
,<|im_end|>
and|endoftext|>
should be encoded as special tokens.
Hi, thank you for the information, actually I forgot to pass use_fast=False
when initializing the tokenizer, which caused this problem.
I compared both fast and non-fast mode of the tokenizer:
>>> model_path = "" # (model version: commit id: 4af5d306ab)
>>> tk = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
>>> tk_fast = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=True)
>>> labels = [33228, 2]
>>> print(tk.decode(labels, skip_special_tokens=False))
hello<|endoftext|>
>>> print(tk_fast.decode(labels, skip_special_tokens=False))
hello</s>
Is this behavior reasonable?
<|im_start|>user 1+1=<|im_end|> <|im_start|>assistant 2<|endoftext|>
What is the correct eos token to use during finetuning actually?
<|endoftext|> (64001)
or <|im_end|>(7)
??
@shihanmax this is weired?
print(tk.decode(labels, skip_special_tokens=False)) hello<|endoftext|> print(tk_fast.decode(labels, skip_special_tokens=False)) hello
<|im_start|>user 1+1=<|im_end|> <|im_start|>assistant 2<|endoftext|>
What is the correct eos token to use during finetuning actually?
<|endoftext|> (64001)
or<|im_end|>(7)
??
I think <|im_end|> shouldn't be eos token (will casuse generation stops early), and it is optional at the end of the text.
<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|im_end|><|endoftext|>
@shihanmax on the contrary, when you inferencing, you should get only one <|im_end|> which is after assistant
@shihanmax this is weired?
print(tk.decode(labels, skip_special_tokens=False)) hello<|endoftext|> print(tk_fast.decode(labels, skip_special_tokens=False)) hello
It seems a bit confusing..
from transformers import AutoTokenizer
tk = AutoTokenizer.from_pretrained("Yi-6B-Chat", trust_remote_code=True, use_fast=False)
tk_fast = AutoTokenizer.from_pretrained("Yi-6B-Chat", trust_remote_code=True, use_fast=True)
print(tk.convert_ids_to_tokens([2])) # ['<|endoftext|>']
print(tk_fast.convert_ids_to_tokens([2])) # ['</s>']
Yes, this is very weired,
use_fast=False
is always preferred at the moment.
For example:
the questions answered ok, but too many .
I know there are some way to ignore it, but normally it shouldn't have, why? I am finetune full params on Yi-6B-200k