allenai / OLMoE

OLMoE: Open Mixture-of-Experts Language Models
https://arxiv.org/abs/2409.02060
Apache License 2.0
387 stars 30 forks source link

Tokenized dataset? #10

Open joelburget opened 1 week ago

joelburget commented 1 week ago

I was wondering if it'd be possible to upload the tokenized dataset. I tried following the instructions under the Pretraining header but had trouble installing Megablocks due to a CUDA version mismatch. Anyway, I think it would be very helpful to upload the tokenized dataset to Huggingface to save others the work.

Muennighoff commented 1 week ago

Agree that this would be great; @soldni what do you think? Here are all the s3 paths of the tokenized ds, can we easily upload them to HF?