allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.37k stars 431 forks source link

Is there explicitly instruction-following data in the version of Dolma used to train v1? #658

Open john-hewitt opened 1 month ago

john-hewitt commented 1 month ago

Hi everyone,

I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no explicitly instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.

Some thoughts:

I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?

See corresponding Dolma request; wasn't sure how much information sharing there would be between the two: https://github.com/allenai/dolma/issues/177

Thanks!

soldni commented 1 month ago

(responded on ticket in Dolma repository)