allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.48k stars 449 forks source link

Default eos_token_id in `scripts/prepare-tulu-data.py` #597

Closed y0mingzhang closed 4 months ago

y0mingzhang commented 4 months ago

https://github.com/allenai/OLMo/blob/5789cfe32390a0e80417e98285647cb8b41029ae/scripts/prepare_tulu_data.py#L122

According to the tokenizer "allenai/eleuther-ai-gpt-neox-20b-pii-special", the eos token should have id 0 instead of 50279, which instead points to the pii token |||IP_ADDRESS|||.

y0mingzhang commented 4 months ago

I figured out what's wrong. It looks like olmo training data are actually tokenized with allenai/gpt-neox-olmo-dolma-v1_5, despite the official configs pointing to allenai/eleuther-ai-gpt-neox-20b-pii-special.

https://github.com/allenai/OLMo/blob/ae84d479fa5775b1935b50b2120e0b514313ce18/configs/official/OLMo-1B.yaml#L55

AFAIK, the two tokenizers differ only in two ways:

Perhaps it makes sense to just update the official configs if dolma is the right tokenizer to use.