allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.2k stars 392 forks source link

HF dataset loading optimizations #623

Closed 2015aroras closed 3 weeks ago

2015aroras commented 3 weeks ago

Issue: Loading HF datasets for downstream evals has been slowing down the start of our runs because:

  1. Every process tries to load each HF dataset at the same time. This both results in a lot of network traffic to one endpoint (potentially resulting in throttling?), and also maybe some contention over HF's dataset cache on disk.
  2. Even when a dataset has already been cached locally, trying to load the dataset results in network calls as HF checks for changes to the data.

Fix: This PR tackles these issues by:

  1. Utilizing HF's load_from_disk and save_to_disk dataset methods to save a local copy of the datasets. These datasets are no longer associated with the online versions, and so loading them from disk does not result in network traffic (as far as I can tell).
  2. Making only the FS rank 0 process perform network calls for HF dataset loading. Doing this requires coordination between the processes using barrier(), but this seems to be less problematic. Note that barriers are invoked each time a dataset needs to be loaded (on the positive side, if all datasets are already in the cache then there are no barrier invocations).

In a 2 GPU interactive session, the performance goes from 10 mins without this new cache to 1 min with this cache ready beforehand. I have populated a copy of the cache already at /net/weka/reviz/hf_datasets_cache.

I haven't yet verified that the evaluation correctness is not affected by this change (because no GPUs available), but I don't expect this to negatively impact eval correctness.

Edit: Now the barrier will always be invoked exactly once per dataset