allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
972 stars 107 forks source link

FastText training data #209

Open msaebi1993 opened 1 month ago

msaebi1993 commented 1 month ago

Hi, I was wondering if you could share more details about training data mix for your fastText model In your blog, you mentioned you've used the following sources: https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d

Specifically, I have the following questions:

  1. Can you please elaborate on the percentage of each data in the final training data mix? Any chance you could share the training data as well?
  2. I see in a recent commit that you're using a new fastText model. Does this have the same training data mix as the one described in the blog? Can you please elaborate on the difference between the two?