Closed y0mingzhang closed 4 months ago
I figured out what's wrong. It looks like olmo training data are actually tokenized with allenai/gpt-neox-olmo-dolma-v1_5
, despite the official configs pointing to allenai/eleuther-ai-gpt-neox-20b-pii-special
.
AFAIK, the two tokenizers differ only in two ways:
"|||IP_ADDRESS|||"
is tokenized as 0 and "<|endoftext|>"
as 50279, or the other way around.Perhaps it makes sense to just update the official configs if dolma is the right tokenizer to use.
https://github.com/allenai/OLMo/blob/5789cfe32390a0e80417e98285647cb8b41029ae/scripts/prepare_tulu_data.py#L122
According to the tokenizer "allenai/eleuther-ai-gpt-neox-20b-pii-special", the eos token should have id
0
instead of50279
, which instead points to the pii token|||IP_ADDRESS|||
.