bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
113 stars 20 forks source link

[Near Deduplication] Post processing #9

Open ChenghaoMou opened 2 years ago

ChenghaoMou commented 2 years ago

The current script building clusters of duplicates, but there are cases it might yield unwanted results:

When doc B is clustered under doc A's name, another doc C can also be clustered into B's name (A~B, B~C, C!~A), thus when we are deleting non "extreme"s from each cluster, we could end up having both A and B kept in the results.

A better way to delete duplicates is to find community within each connected components. This is used in https://github.com/src-d/gemini.