Closed TTTTao725 closed 5 months ago
Hi!
Thank you for the bug report. To clarify, is this using the version currently available on PyPI, or is it running on main branch?
I think the problem is that the document level deduplication option is meant to be used with document metadata, so you can filter out documents with the same URL for example. So if you set --dedupe.documents.attribute_name 'duplicate_documents'
you must also set this option:
--dedupe.documents.key [DEDUPE.DOCUMENTS.KEY]
Name of the input field to use for deduplication, e.g. `$.metadata.url`
The actual content-based deduplication runs on paragraphs, but you can filter entire documents if too many paragraphs are duplicates. This is not super clear in the documentation.
If you set --dedupe.paragraphs.attribute_name 'bff_duplicate_paragraph_spans'
instead of --dedupe.documents.attribute_name
it should work.
Thanks Peter, I have used --dedupe.paragraphs.attribute_name 'bff_duplicate_paragraph_spans'
to filter documents before, but it was quite heavy: 5000 documents -> 44 documents after using this. So I wondered if I used the wrong command and tested --dedupe.documents.attribute_name 'duplicate_documents'
.
Good to know the right way of using dedupe.documents
:)
Happy new year! /Tao
Hi!
Thank you for the bug report. To clarify, is this using the version currently available on PyPI, or is it running on main branch?
The one I'm using:
pip install git+https://github.com/allenai/dolma.git@28727302a34225f4ce70ce97940cb18ad9a91583
Closing since this should be now fixed in main/latest release. LMK if that's not the case!
I tested the document deduplication command on a sample dataset got the following error.
To replicate the problem I encountered, I've generated a dummy dataset consisting of six documents:
Then tried to run:
(P.S I have also tried to reduce the number of processes to 1, didn't work out either.)
where I obtain the error: