Closed 2015aroras closed 1 month ago
This PR changes the loading of HF datasets in our downstream eval so that they are loaded from the olmo_data package. This removes the need for network calls or special caching mechanisms and, broadly speaking, was @dirkgr's idea.
olmo_data
The PR looks big and ugly, but it's mostly busy work (like adding all the datasets files). The commits of interest (if any) in my opinion are https://github.com/allenai/OLMo/pull/646/commits/84dbf43d33188d8df404647e04358d1fdfd1439d and https://github.com/allenai/OLMo/pull/646/commits/d112d4c42b7df8726cc51eb699cb9474baed61bf
Tested on LUMI on 1 node on same checkpoint. 2 mins loading compared to 20 mins without this change, same downstream evals.
This PR changes the loading of HF datasets in our downstream eval so that they are loaded from the
olmo_data
package. This removes the need for network calls or special caching mechanisms and, broadly speaking, was @dirkgr's idea.The PR looks big and ugly, but it's mostly busy work (like adding all the datasets files). The commits of interest (if any) in my opinion are https://github.com/allenai/OLMo/pull/646/commits/84dbf43d33188d8df404647e04358d1fdfd1439d and https://github.com/allenai/OLMo/pull/646/commits/d112d4c42b7df8726cc51eb699cb9474baed61bf
Tested on LUMI on 1 node on same checkpoint. 2 mins loading compared to 20 mins without this change, same downstream evals.