Closed chawins closed 1 month ago
Hello @chawins ! for each file in olmo-data.org, there's actually a corresponding *.csv.gz
that contains provenance info for data in npy. So for https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.npy
, it would be https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.csv.gz
.
Each CSV has five columns:
So for example :
0,1390,https://smallerpictures1.wordpress.com/2019/09/06/it-chapter-two-2019/,s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz,1870762
1390,1854,https://smolenskklad.ru/who-is-aidan-quinn-dating-147.html,s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz,1870763
Indicates that tokens from position 0 (included) to 1390 (excluded) are from document with ID https://smallerpictures1.wordpress.com/2019/09/06/it-chapter-two-2019/
in file olmo-mix/v1_5/documents/cc_en_head/cc_en_head-0217.json.gz
. This document is at the 1870762th line.
Hope this helps!
Best, Luca
Oh this is totally awesome! This should work great for my use case. Thank you so much for all the detailed answer (and of course, for the amazing work here :)). Appreciate it!
❓ The question
Hello! I have a question about a way to get OLMo training data by sources (e.g., wiki, books, etc.). I suspect that this may be difficult, but I would like to check if I miss anything.
https://olmo-data.org/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.npy
) is already tokenized, mixed, deduped, and shuffled? So there is no easy way to, says, find all training tokens that come frombooks
.Is my understanding correct? I'd appreciate any advice. Thank you!