eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
62 stars 12 forks source link

fix added tokens #101

Closed vince62s closed 2 months ago

vince62s commented 2 months ago

When using tokenizer.model (mostly older llama1/2 and mistral) with sentencepiece, look also at tokenizer.json and get the "added_tokens" which are not in the sentencepiece model vocab.