allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Need clarification of Gopher in Step 2 #172

Open mihara-bot opened 2 months ago

mihara-bot commented 2 months ago

Dear authors, I was trying to reimplement the Dolma-Web described in your paper. However, in the Step 2, using the dolma toolkit, I found Gopher implementation in this repo something different with original Gopher at http://arxiv.org/abs/2112.11446. Specifically, There are no computations for 'Duplicate paragraph fraction' and 'Duplicate paragraph character fraction' in current code at /python/dolma/taggers.py , which are provided in Table A1 in the Gopher paper.

Is this a bug or there is no need to compute these metrics? Looking forward to your kind reply.

Best regards, Xinlin Zhuang