Issue: Tokenizer.from_checkpoint assumes that tokenizers are in HF or in a path relative to the current directory. This assumption doesn't hold when OLMo is install from pip as ai2-olmo. More generally, we don't have a clear mechanism for putting data files in our repo.
Fix: This PR creates an olmo_data package, of which the subdirectories can correspond to various types of data (e.g. tokenizers and hf_datasets). Tokenizers are moved under olmo_data, and Tokenizer.from_checkpoint is updated to look at local paths, then olmo_data, then HF Hub.
This change sets up the foundations for adding HF datasets to our repo, so that we don't have to make network calls during training runs.
Issue:
Tokenizer.from_checkpoint
assumes that tokenizers are in HF or in a path relative to the current directory. This assumption doesn't hold when OLMo is install from pip asai2-olmo
. More generally, we don't have a clear mechanism for putting data files in our repo.Fix: This PR creates an
olmo_data
package, of which the subdirectories can correspond to various types of data (e.g.tokenizers
andhf_datasets
). Tokenizers are moved underolmo_data
, andTokenizer.from_checkpoint
is updated to look at local paths, thenolmo_data
, then HF Hub.This change sets up the foundations for adding HF datasets to our repo, so that we don't have to make network calls during training runs.
Fixes #633