eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
24 stars 6 forks source link

The unknown token in the vocabulary of finetuned checkpoint breaks the prediction at inference time #59

Closed l-k-11235 closed 1 week ago

l-k-11235 commented 2 weeks ago

When I fine-tune a llama3 model and save a checkpoint, it also saves a vocab.json file which is slightly different from vocab.json of the base model because it contains the unknown token as the first token. It completely breaks the predictions at inference time, even at the step 0 of the finetuning. I manually replaced the merged model's vocab.json file with that of the main model and it fixed the problem.

vince62s commented 2 weeks ago

yes there is an issue here: https://github.com/eole-nlp/eole/blob/main/eole/bin/convert/convert_HF.py#L203 this specials table is for llama2. we need to make it different between llama2 and llama3

vince62s commented 2 weeks ago

In fact you should not have the issue if you do not have a "tokenizer.model" file in the folder. if there is NOT such a file then convert_HF will leverage the tokenizer.json and will not add these specials. please check your folder and try again.

l-k-11235 commented 2 weeks ago

In which folder ?. The vocab.json of the checkpoint obtained with convert HF does not contain the unknown token, the vocab.json of the finetuned model does. There is a bpe.model in the converted checkpoint folder.

vince62s commented 2 weeks ago

sorry in the case of llama3 I think you need to put in the config default_specials=[] because of this: https://github.com/eole-nlp/eole/blob/main/eole/config/data.py#L29

l-k-11235 commented 2 weeks ago

We have the mapping for eos, bos and pad tokens in convert_HF.py https://github.com/eole-nlp/eole/blob/c74e495d9bf55f5c7b2b2968097c5c1e813a4d5c/eole/bin/convert/convert_HF.py#L892-L909 but what about the unknow token ? If I put default_tokens: []in the training config, the model wil be trained without the unknown token, right ? However the data could contain "new' symbols or emojis for instance.

l-k-11235 commented 2 weeks ago

If we want to handle unknown tokens without changing the vocabulary, shouldn't we use a "reserved token" ?

l-k-11235 commented 2 weeks ago

When the vocabulary does not contain the unknown token, unknown tokens are mapped to the first token. https://github.com/eole-nlp/eole/blob/c74e495d9bf55f5c7b2b2968097c5c1e813a4d5c/eole/inputters/inputter.py#L51 For llama3 it corresponds to '!' which is not so bad. Unknown token are supposed to be rare, so we can put default_specials: [] in the training config, to prevent inserting new token at the beginning of the vocabulary. However, using something else than default_specials: [] with pretrained models will always create an offset in the vocabulary ids. We should probably avoid doing this when fine-tuning with train_from without updating the vocabulary.

vince62s commented 1 week ago

63 should fix the issue that you have to set default_specials=[]