cs50 / compare50

This is compare50, a fast and extensible plagiarism-detection tool.
GNU General Public License v3.0
192 stars 49 forks source link

show uniqueness of each match #36

Open benedictbrown opened 4 years ago

benedictbrown commented 4 years ago

For each matching area, show how many files match ("2 files" means just the current two files), similar to etector. Etector gives details in tooltip, and give larger font size to more unique things. Former is very helpful, but latter can make comparing files difficult.

Knowing how unique a match is is very helpful for determining which cases to refer and articulating to committee how improbable the similarity is.

cmlsharp commented 4 years ago

Agreed this would be a good feature. Interestingly, we do actually do this to some extent behind the scenes, but at present it isn't expressed to the user.

In the ranking phase, we keep a frequency map of the k-grams we've seen and use it to do a modified inverse document frequency type thing when computing the score for two submissions. So rarity is considered for the ranking.

The slight trouble with displaying this to the user is in the more intensive comparing phase (which is only run for the top n ranked pairs), we actually expand the matching k-grams. So like if we have a particular k-gram that occurs in two submissions but there are additionally some surrounding characters that happen to match in both documents too, we'll absorb those into the match. While we know the frequency of the original k-gram, we don't know the frequency of the expansion that we end up showing the user. We could sort of fudge it and pretend they're the same, but that seems a but dicey.

benedictbrown commented 4 years ago

Etector may be facing a similar issue--its tooltips show a matching selection that is sometimes different from what is highlighted in the text. So likely it is showing the uniqueness of k-grams.