EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics
Apache License 2.0
2.23k stars 165 forks source link

Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115

Closed RylanSchaeffer closed 1 year ago

RylanSchaeffer commented 1 year ago

I apologize if this has been asked before but I couldn't find the answer on GitHub or HuggingFace! I also asked on Discord, and I will cross-post the answer to whichever responds slower.

For the Pythia models, what is the relationship between tokenizers at different size, different steps and different data preprocessing (duplicated vs deduplicated)?

The demo shows:

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

This suggests to me that the Pythia tokenizers are a function of all three: size (70M), step (3000), data (deduplicated).

But this doesn't make sense to me. Rather, I would guess that the answer is either:

  1. There is one Pythia tokenizer, shared by all sizes, steps and data preprocessing

  2. There are two Pythia tokenizers, one for deduplicated, the other for not

Could someone please clarify?

RylanSchaeffer commented 1 year ago

Answer from Stella on Discord:

There is one Pythia tokenizer and it’s the same tokenizer as used by GPT-NeoX-20B, MPT, and a bunch of other models too

It’s generally considered best practice to write the code like that because then you develop habits that are invariant to the tokenizer and you don’t need to know which models use the GPT-2 tokenizer, which models use the GPT-NeoX tokenizer, etc