allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.75k stars 485 forks source link

Getting training data by sources #728

Closed chawins closed 1 month ago

chawins commented 1 month ago

❓ The question

Hello! I have a question about a way to get OLMo training data by sources (e.g., wiki, books, etc.). I suspect that this may be difficult, but I would like to check if I miss anything.

Is my understanding correct? I'd appreciate any advice. Thank you!

soldni commented 1 month ago

Hello @chawins ! for each file in olmo-data.org, there's actually a corresponding *.csv.gz that contains provenance info for data in npy. So for https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.npy, it would be https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.csv.gz.

Each CSV has five columns:

So for example :

0,1390,https://smallerpictures1.wordpress.com/2019/09/06/it-chapter-two-2019/,s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz,1870762
1390,1854,https://smolenskklad.ru/who-is-aidan-quinn-dating-147.html,s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz,1870763

Indicates that tokens from position 0 (included) to 1390 (excluded) are from document with ID https://smallerpictures1.wordpress.com/2019/09/06/it-chapter-two-2019/ in file olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz. This document is at the 1870762th line.

Hope this helps!

Best, Luca

chawins commented 1 month ago

Oh this is totally awesome! This should work great for my use case. Thank you so much for all the detailed answer (and of course, for the amazing work here :)). Appreciate it!