allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

A Question about the meaning of dolma_v1.6_cc_en #134

Closed aleien95 closed 5 months ago

aleien95 commented 5 months ago

Hello, I found that the naming of the dolma_v1.6_cc_en includes cc_en_head,cc_en_middle and cc_en_tail. What do these names mean?

soldni commented 5 months ago

Hi @aleien95,

Names refer to buckets in which the CCNet pipeline organizes documents extracted from common crawl. The CCNet pipeline estimates how similar documents are to wikipedia pages using a KenLM statistical language model. Documents that are highly similar are placed in cc_en_head, followed by cc_en_middle and cc_en_tail.

We retain the same layout out of convenience.

Hope this helps! Feel free to reopen this issue if you have more questions.

Best, Luca