allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
909 stars 94 forks source link

How does Exact paragraph deduplication performed? #111

Closed silverriver closed 6 months ago

silverriver commented 7 months ago

hi, thank you for your great work.

I am wondering how exactly does the "Exact paragraph deduplication" operation is carried out?

For my understanding, "Exact paragraph deduplication" follows these steps:

  1. split each document to paragraphs
  2. detect dupliacted paragraphs (using the bloom filter)
  3. remove dupliacted paragraphs.

However, there are a few questions:

  1. for step 3, assume you have a paragraph that is dupliacted for N times. Then it is reasonable to remove N-1 dupliacted ones. I am wondering which one of the N paragraph should be retrained and which N-1 ones should be removed?
  2. removing a paragraph from a document will usually hurt the original document. Will it down grade the data quality?
soldni commented 7 months ago

hello!

regarding your questions:

  1. We keep the first one seen by the deduper. the process is non-deterministic.
  2. Paragraph deduplication is fairly standard practice when training LM models, so shouldn't be an issue.