[Exact Substring Deduplication] Analysis

bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.

Apache License 2.0

115 stars 20 forks source link

Open ChenghaoMou opened 2 years ago

ChenghaoMou commented 2 years ago

Near deduplication #7 only operates on file level. It is also possible for a file to be

a substring of another file, while the minhash/simhash fingerprints being wildly different
composed of multiple snippets from different sources

Do we do something about them, knowing they contains large chunks of repeated snippets?

lvwerra commented 2 years ago

How hard would it be to do some analysis of how often this is the case maybe on a subset of data?