Getting training data by sources

chawins commented 1 month ago

❓ The question

Hello! I have a question about a way to get OLMo training data by sources (e.g., wiki, books, etc.). I suspect that this may be difficult, but I would like to check if I miss anything.

The training data specified in the config file (e.g., https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.npy) is already tokenized, mixed, deduped, and shuffled? So there is no easy way to, says, find all training tokens that come from books.
I noticed that I can download Dolma by source (e.g., from this URL https://huggingface.co/datasets/allenai/dolma/blob/main/urls/v1_5-sample.txt), but these are samples before processing. So some texts here may later get dropped, filtered, or deduped, and do not actually show up in the training data.

Is my understanding correct? I'd appreciate any advice. Thank you!

soldni commented 1 month ago

Hello @chawins ! for each file in olmo-data.org, there's actually a corresponding *.csv.gz that contains provenance info for data in npy. So for https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.npy, it would be https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.csv.gz.

Each CSV has five columns:

start position in npy
end position in npy
document id
source JSONL text file
line number in JSONL file, 1-indexed.

So for example :

0,1390,https://smallerpictures1.wordpress.com/2019/09/06/it-chapter-two-2019/,s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz,1870762
1390,1854,https://smolenskklad.ru/who-is-aidan-quinn-dating-147.html,s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz,1870763

Indicates that tokens from position 0 (included) to 1390 (excluded) are from document with ID https://smallerpictures1.wordpress.com/2019/09/06/it-chapter-two-2019/ in file olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz. This document is at the 1870762th line.

Hope this helps!

Best, Luca

chawins commented 1 month ago

Oh this is totally awesome! This should work great for my use case. Thank you so much for all the detailed answer (and of course, for the amazing work here :)). Appreciate it!

allenai / OLMo

Getting training data by sources #728

❓ The question