training directly from object storage?

joellliu commented 1 month ago

❓ The question

Hi OLMo team, from the configs it seems you store the training data in the object storage (Cloudfare R2), and directly loading them from the object storage during training without pre-downloading/ caching them into the local storage. Is that correct? And do you observe any latency/ slower training because of this? Thank you!

2015aroras commented 1 month ago

In the official configs we use training data from object storage (R2), but in some of our training runs (including the Twin-2T run) we use local storage instead. We do prefer using local storage when we can, but that isn't an option when running on a GPU provider that uses different file system instances for every node.

Someone on our team may have looked into the specific performance drop due to downloading, but my guess is we haven't due to the above reason (if downloading is inevitable then why measure it, and if downloading is not needed then why not download it beforehand?). To my (limited) knowledge, our code only grabs chunks of data we need at a time (maybe with a little extra), and so the download time is dwarfed by the model compute time.

joellliu commented 4 weeks ago

Ok thank you!

allenai / OLMo

training directly from object storage? #592

❓ The question