Closed lucasjinreal closed 11 months ago
First I thought it was my conv template problem, but after changing to using fastchat like conv template, with <|endoftext|> as splitter.
the result still got a lot of
I am confused now, is that the data contains a lot of such data when training base model?
It shouldn't happen when sft got a lot of output .
It can set as a badwords or token id, but this shouldn't be since it's <|endoftext|> in tokenizer.
We can continue the discussion at https://github.com/01-ai/Yi/issues/80
First I thought it was my conv template problem, but after changing to using fastchat like conv template, with <|endoftext|> as splitter.
the result still got a lot of
I am confused now, is that the data contains a lot of such data when training base model?
It shouldn't happen when sft got a lot of output .
It can set as a badwords or token id, but this shouldn't be since it's <|endoftext|> in tokenizer.