Closed xingenju closed 4 years ago
The decision may seems weird, but when looking at the data it appeared that most duplicate paragraph was either:
Therefore I chose to not include them. But I agree this decision isn't really documented in the paper, and could be worth some experiments.
Closing, feel free to reopen if I haven't fully answered your questions.
eg. if "it is an issue about cc_net" is a paragraph and it appeared three times, as the NativeHashSet saves the value of this key is 1, the 3 paragraphs will be dropped. Why not save one copy?