allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.37k stars 431 forks source link

Issue with tokenizer wrapper #644

Open davidbrandfonbrener opened 2 months ago

davidbrandfonbrener commented 2 months ago

❓ The question

the tokenizer wrapper causes unintended behavior when the tokenizer has a bos token (like the llama tokenizers). In particular, the call to the base_tokenizer encode function will add bos tokens even when add_special_tokens=False.

The issue is that the default here is for the base_tokenizer to have add_special_tokens=True.

This should be fairly easy to fix, but to properly handle tokenizers with bos tokens, the wrapper would need to be changed more broadly.

This also raised the question for me, why is this wrapper needed in the first place instead of using the huggingface library? I wanted to better understand the motivation before making changes.