Default eos_token_id in `scripts/prepare-tulu-data.py`

allenai / OLMo

Modeling, training, eval, and inference code for OLMo

Apache License 2.0

4.48k stars 449 forks source link

I figured out what's wrong. It looks like olmo training data are actually tokenized with allenai/gpt-neox-olmo-dolma-v1_5, despite the official configs pointing to allenai/eleuther-ai-gpt-neox-20b-pii-special.

https://github.com/allenai/OLMo/blob/ae84d479fa5775b1935b50b2120e0b514313ce18/configs/official/OLMo-1B.yaml#L55

AFAIK, the two tokenizers differ only in two ways:

whether "|||IP_ADDRESS|||" is tokenized as 0 and "<|endoftext|>" as 50279, or the other way around.
handling of long whitespaces.

Perhaps it makes sense to just update the official configs if dolma is the right tokenizer to use.

allenai / OLMo

Default eos_token_id in `scripts/prepare-tulu-data.py` #597