allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Clarification Needed on "C4 NoPunc" in Data Processing #162

Closed codefly13 closed 3 months ago

codefly13 commented 3 months ago

I am currently working with a dataset and noticed the term "C4 NoPunc" used in the context of data quality filtering. I would like to clarify what exactly this term refers to. Specifically, does "C4 NoPunc" mean:

  1. Quality filters are applied except for the "lines_with_no_ending_punctuation" rule. This means all other C4 quality filters are applied, but lines are not removed based solely on the absence of ending punctuation.

  2. Only the "lines_with_no_ending_punctuation" rule is used in quality filtering. This means that the sole criterion for removing lines is the absence of ending punctuation, and no other C4 quality filters are applied.

Could you please provide some insight into which of these interpretations is correct, or if there's another meaning entirely?

soldni commented 3 months ago

Hi @codefly13!

It's the latter: only the lines_with_no_ending_punctuation rule is used in quality filtering.

I'm closing this issue assuming that the above answers your question, but please re-open it in case you need further clarification!

Best, Luca