Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing?

I apologize if this has been asked before but I couldn't find the answer on GitHub or HuggingFace! I also asked on Discord, and I will cross-post the answer to whichever responds slower.

For the Pythia models, what is the relationship between tokenizers at different size, different steps and different data preprocessing (duplicated vs deduplicated)?

The demo shows:

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

This suggests to me that the Pythia tokenizers are a function of all three: size (70M), step (3000), data (deduplicated).

But this doesn't make sense to me. Rather, I would guess that the answer is either:

There is one Pythia tokenizer, shared by all sizes, steps and data preprocessing
There are two Pythia tokenizers, one for deduplicated, the other for not

Could someone please clarify?

EleutherAI / pythia

Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115