bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
113 stars 20 forks source link

[Exact Substring Deduplication] Analysis #8

Open ChenghaoMou opened 2 years ago

ChenghaoMou commented 2 years ago

Near deduplication #7 only operates on file level. It is also possible for a file to be

  1. a substring of another file, while the minhash/simhash fingerprints being wildly different
  2. composed of multiple snippets from different sources

Do we do something about them, knowing they contains large chunks of repeated snippets?

lvwerra commented 2 years ago

How hard would it be to do some analysis of how often this is the case maybe on a subset of data?