FastText training data - Githubissues

allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

https://allenai.github.io/dolma/

Apache License 2.0

1.02k stars 108 forks source link

FastText training data #209

Open msaebi1993 opened 2 months ago

msaebi1993 commented 2 months ago

Hi, I was wondering if you could share more details about training data mix for your fastText model In your blog, you mentioned you've used the following sources: https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d

Wikipedia
web-pages cited in Wikipedia (through MegaWika)
Small Web RSS feeds (through Kagi)
OpenHermes 2.5
Semantic Scholar
Project Gutenberg
OpenWebMath

Specifically, I have the following questions:

Can you please elaborate on the percentage of each data in the final training data mix? Any chance you could share the training data as well?
I see in a recent commit that you're using a new fastText model. Does this have the same training data mix as the one described in the blog? Can you please elaborate on the difference between the two?