allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.37k stars 431 forks source link

Load HF datasets from `olmo_data` #646

Closed 2015aroras closed 1 month ago

2015aroras commented 2 months ago

This PR changes the loading of HF datasets in our downstream eval so that they are loaded from the olmo_data package. This removes the need for network calls or special caching mechanisms and, broadly speaking, was @dirkgr's idea.

The PR looks big and ugly, but it's mostly busy work (like adding all the datasets files). The commits of interest (if any) in my opinion are https://github.com/allenai/OLMo/pull/646/commits/84dbf43d33188d8df404647e04358d1fdfd1439d and https://github.com/allenai/OLMo/pull/646/commits/d112d4c42b7df8726cc51eb699cb9474baed61bf

Tested on LUMI on 1 node on same checkpoint. 2 mins loading compared to 20 mins without this change, same downstream evals.