01-ai / Yi

A series of large language models trained from scratch by developers @01-ai
https://01.ai
Apache License 2.0
7.66k stars 476 forks source link

Sft Got a lot of </s> #80

Closed lucasjinreal closed 11 months ago

lucasjinreal commented 11 months ago

For example:

{
    "Q": "正方形上剪掉一个角还剩几个边?",
    "A": "正方形上剪掉一个角,剩下的图形是一个三角形。三角形的三个边分别与正方形的四个角相连,因此,剩下三条边。</s>\n\n---\n\n答案:3条边。</s></s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s>\n</s"
  }

the questions answered ok, but too many .

I know there are some way to ignore it, but normally it shouldn't have, why? I am finetune full params on Yi-6B-200k

loofahcus commented 11 months ago

Have you set eos_token_id=tokenizer.convert_tokens_to_ids("</s>") ?

loofahcus commented 11 months ago

If you don't specify an eos_id, it will continue generating till it meets the max_tokens.

ZhaoFancy commented 11 months ago

https://github.com/01-ai/Yi/issues/112#issue-1989966201

First I thought it was my conv template problem, but after changing to using fastchat like conv template, with <|endoftext|> as splitter.

the result still got a lot of

I am confused now, is that the data contains a lot of such data when training base model?

It shouldn't happen when sft got a lot of output .

It can set as a badwords or token id, but this shouldn't be since it's <|endoftext|> in tokenizer.

Hi @lucasjinreal , could you first confirm you've done as @loofahcus said above?

lucasjinreal commented 11 months ago

I think this is not about setting eos id.

  1. I using conv template like fschat, but make the end of convesation is <|endoftext>| which is eos_token used in tokenizer, but the output generates a lot of </s>, my sft dataset doesn/t contains any data like this, this is not about eos token, just model collapsed I think, I haven't got a model keep generate without genreate <|endoftext|> first;

  2. even thought I can using as a stop words, but this could still a problem since it can not generate <|endoftext|> which suppose to be in trianing data.

ZhaoFancy commented 11 months ago

but the output generates a lot of

Could you print the generated token id?

shihanmax commented 10 months ago

Same question here:

I added tokenizer.eos_token_id to the end of the response input ids, but I got a \</s> token after the answer, why tokenizer.eos_token not taking effect at the SFT procdure?

training data format when do SFT is:

<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|endoftext|>

the inference script is:

model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)

# (model version: commit id: 4af5d306ab)

generate_args = {
    'do_sample': True,
    'temperature': 0.95,
    'top_p': 0.7,
    'top_k': 30,
    'num_beams': 1,
    'max_length': 512,
    'max_new_tokens': 512,
    'repetition_penalty': 1.0,
    'length_penalty': 1.0,
    'eos_token_id': [64001],
    'pad_token_id': 0
}

logits = model.generate(
    input_ids=curr_batch_input_ids, **generating_args
)

answer = tokenizer.decode(logits[0], skip_special_tokens=False)
print(answer)

... some padding tokens... <|im_start|> user
1+2=<|im_end|> 
<|im_start|> assistant
3</s> ... some random tokens ...

the expected output is:

... some padding tokens... <|im_start|> user
1+2=<|im_end|> 
<|im_start|> assistant
3<|endoftext|>

set eos_token_id to [64001, 2] could solve this problem, but I'm wondering why <|endoftext|> is not taking effect? (why did the sft model generate instead of it).

thanks

findmyway commented 10 months ago

Hi @shihanmax,

Could you first confirm your original input is correctly encoded?

<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|endoftext|>

If the tokenizer is correctly configured, <|im_start|>, <|im_end|> and |endoftext|> should be encoded as special tokens.

shihanmax commented 10 months ago

Hi @shihanmax,

Could you first confirm your original input is correctly encoded?

<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|endoftext|>

If the tokenizer is correctly configured, <|im_start|>, <|im_end|> and |endoftext|> should be encoded as special tokens.

Hi, thank you for the information, actually I forgot to pass use_fast=False when initializing the tokenizer, which caused this problem.

I compared both fast and non-fast mode of the tokenizer:

>>> model_path = ""  # (model version: commit id: 4af5d306ab)
>>> tk = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
>>> tk_fast = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=True)

>>> labels = [33228, 2]

>>> print(tk.decode(labels, skip_special_tokens=False))
 hello<|endoftext|>
>>> print(tk_fast.decode(labels, skip_special_tokens=False))
hello</s>

Is this behavior reasonable?

annahung31 commented 10 months ago
<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|endoftext|>

What is the correct eos token to use during finetuning actually? <|endoftext|> (64001) or <|im_end|>(7) ??

lucasjinreal commented 10 months ago

@shihanmax this is weired?

print(tk.decode(labels, skip_special_tokens=False)) hello<|endoftext|> print(tk_fast.decode(labels, skip_special_tokens=False)) hello

shihanmax commented 10 months ago
<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|endoftext|>

What is the correct eos token to use during finetuning actually? <|endoftext|> (64001) or <|im_end|>(7) ??

I think <|im_end|> shouldn't be eos token (will casuse generation stops early), and it is optional at the end of the text.

<|im_start|>user
1+1=<|im_end|>
<|im_start|>assistant
2<|im_end|><|endoftext|>
lucasjinreal commented 10 months ago

@shihanmax on the contrary, when you inferencing, you should get only one <|im_end|> which is after assistant

shihanmax commented 10 months ago

@shihanmax this is weired?

print(tk.decode(labels, skip_special_tokens=False)) hello<|endoftext|> print(tk_fast.decode(labels, skip_special_tokens=False)) hello

It seems a bit confusing..

from transformers import AutoTokenizer

tk = AutoTokenizer.from_pretrained("Yi-6B-Chat", trust_remote_code=True, use_fast=False)
tk_fast = AutoTokenizer.from_pretrained("Yi-6B-Chat", trust_remote_code=True, use_fast=True)

print(tk.convert_ids_to_tokens([2]))   # ['<|endoftext|>']
print(tk_fast.convert_ids_to_tokens([2]))  # ['</s>']
lucasjinreal commented 10 months ago

Yes, this is very weired,

findmyway commented 10 months ago

use_fast=False is always preferred at the moment.