Gernerated a lot of </s> when sft on 200k model

01-ai / Yi

A series of large language models trained from scratch by developers @01-ai

https://01.ai

Apache License 2.0

7.66k stars 476 forks source link

Closed lucasjinreal closed 11 months ago

lucasjinreal commented 11 months ago

First I thought it was my conv template problem, but after changing to using fastchat like conv template, with <|endoftext|> as splitter.

the result still got a lot of

I am confused now, is that the data contains a lot of such data when training base model?

It shouldn't happen when sft got a lot of output .

It can set as a badwords or token id, but this shouldn't be since it's <|endoftext|> in tokenizer.

ZhaoFancy commented 11 months ago