01-ai / Yi

A series of large language models trained from scratch by developers @01-ai
https://01.ai
Apache License 2.0
7.66k stars 476 forks source link

Gernerated a lot of </s> when sft on 200k model #112

Closed lucasjinreal closed 11 months ago

lucasjinreal commented 11 months ago

First I thought it was my conv template problem, but after changing to using fastchat like conv template, with <|endoftext|> as splitter.

the result still got a lot of

I am confused now, is that the data contains a lot of such data when training base model?

It shouldn't happen when sft got a lot of output .

It can set as a badwords or token id, but this shouldn't be since it's <|endoftext|> in tokenizer.

ZhaoFancy commented 11 months ago

We can continue the discussion at https://github.com/01-ai/Yi/issues/80