I am wondering how exactly does the "Exact paragraph deduplication" operation is carried out?
For my understanding, "Exact paragraph deduplication" follows these steps:
split each document to paragraphs
detect dupliacted paragraphs (using the bloom filter)
remove dupliacted paragraphs.
However, there are a few questions:
for step 3, assume you have a paragraph that is dupliacted for N times. Then it is reasonable to remove N-1 dupliacted ones. I am wondering which one of the N paragraph should be retrained and which N-1 ones should be removed?
removing a paragraph from a document will usually hurt the original document. Will it down grade the data quality?
hi, thank you for your great work.
I am wondering how exactly does the "Exact paragraph deduplication" operation is carried out?
For my understanding, "Exact paragraph deduplication" follows these steps:
However, there are a few questions: