allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.37k stars 431 forks source link

Move tokenizers to new `olmo_data` package. #645

Closed 2015aroras closed 2 months ago

2015aroras commented 2 months ago

Issue: Tokenizer.from_checkpoint assumes that tokenizers are in HF or in a path relative to the current directory. This assumption doesn't hold when OLMo is install from pip as ai2-olmo. More generally, we don't have a clear mechanism for putting data files in our repo.

Fix: This PR creates an olmo_data package, of which the subdirectories can correspond to various types of data (e.g. tokenizers and hf_datasets). Tokenizers are moved under olmo_data, and Tokenizer.from_checkpoint is updated to look at local paths, then olmo_data, then HF Hub.

This change sets up the foundations for adding HF datasets to our repo, so that we don't have to make network calls during training runs.

Fixes #633

OyvindTafjord commented 1 month ago

@2015aroras Could we add something which doesn't break scripts/convert_olmo_to_hf_new.py for old configs? (i.e., this stuff)