A quick de-noising trick for dataset extraction (be it documents or paragraphs), is to hash the document and maintain a set of unique shas, only including unseen shas in the extracted data.
This will allow us to only include once various artifacts such as documents that are simple "this document has been retracted" messages, and also remove potential trivial duplicates in the paragraph data (e.g. "the proof is left as an exercise to the reader").
The c14n submodule already has an MD5 solution to aid such ideas, but it would be best to upgrade to SHA-256 and worry even less about possible collisions.
A quick de-noising trick for dataset extraction (be it documents or paragraphs), is to hash the document and maintain a set of unique shas, only including unseen shas in the extracted data.
This will allow us to only include once various artifacts such as documents that are simple "this document has been retracted" messages, and also remove potential trivial duplicates in the paragraph data (e.g. "the proof is left as an exercise to the reader").
The
c14n
submodule already has an MD5 solution to aid such ideas, but it would be best to upgrade to SHA-256 and worry even less about possible collisions.