Closed l-k-11235 closed 1 week ago
yes there is an issue here: https://github.com/eole-nlp/eole/blob/main/eole/bin/convert/convert_HF.py#L203 this specials table is for llama2. we need to make it different between llama2 and llama3
In fact you should not have the issue if you do not have a "tokenizer.model" file in the folder. if there is NOT such a file then convert_HF will leverage the tokenizer.json and will not add these specials. please check your folder and try again.
In which folder ?. The vocab.json
of the checkpoint obtained with convert HF
does not contain the unknown token, the vocab.json of the finetuned model does.
There is a bpe.model
in the converted checkpoint folder.
sorry in the case of llama3 I think you need to put in the config default_specials=[] because of this: https://github.com/eole-nlp/eole/blob/main/eole/config/data.py#L29
We have the mapping for eos, bos and pad tokens in convert_HF.py
https://github.com/eole-nlp/eole/blob/c74e495d9bf55f5c7b2b2968097c5c1e813a4d5c/eole/bin/convert/convert_HF.py#L892-L909
but what about the unknow token ? If I put default_tokens: []
in the training config, the model wil be trained without the unknown token, right ? However the data could contain "new' symbols or emojis for instance.
If we want to handle unknown tokens without changing the vocabulary, shouldn't we use a "reserved token" ?
When the vocabulary does not contain the unknown token, unknown tokens are mapped to the first token.
https://github.com/eole-nlp/eole/blob/c74e495d9bf55f5c7b2b2968097c5c1e813a4d5c/eole/inputters/inputter.py#L51
For llama3 it corresponds to '!' which is not so bad. Unknown token are supposed to be rare, so we can put default_specials: []
in the training config, to prevent inserting new token at the beginning of the vocabulary.
However, using something else than default_specials: []
with pretrained models will always create an offset in the vocabulary ids.
We should probably avoid doing this when fine-tuning with train_from without updating the vocabulary.
When I fine-tune a llama3 model and save a checkpoint, it also saves a
vocab.json
file which is slightly different fromvocab.json
of the base model because it contains the unknown token as the first token. It completely breaks the predictions at inference time, even at the step 0 of the finetuning. I manually replaced the merged model'svocab.json
file with that of the main model and it fixed the problem.