allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Duplicate ids in Dolma v1.7 #157

Open Vedaad-Shakib opened 4 months ago

Vedaad-Shakib commented 4 months ago

Hi,

While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same id field in the dataset. E.g. in the Project Gutenberg source, there are 175 duplicates that can be found by just looking at the id column. An example of a duplicate id is 8fddd3535f86e159339e1ff9be64fdda in the RefinedWeb split. This was surprising given that you had done significant deduping in Dolma 1.7. Is this a bug in the dataset?