cs50 / compare50

This is compare50, a fast and extensible plagiarism-detection tool.
GNU General Public License v3.0
192 stars 49 forks source link

In match_#.html, identical lines sometimes aren't highlighted (in exact or verbatim) #73

Open dmalan opened 4 years ago

Jelleas commented 4 years ago

The current comparison methods don't do line-by-line comparison, but use k-grams (continuous sequences of tokens) instead. In this model it can be that short lines don't match each other, because the lines themselves are below the noise threshold (length of the k-gram), and the surrounding tokens on the next/previous line don't match.

It is possible and perhaps an interesting idea to create an alternative comparison method to do verbatim line-by-line comparison. That might make it easier to spot similar variable names in relatively short declarations int foo = 3. But perhaps there should be some some fuzzy matching on longer lines too, otherwise it'd be easy to escape this method of comparison 😉

dmalan commented 4 years ago

Hm, so on the back end the k-grams make sense for efficiency, but once we have a pairwise match, couldn't we do a thorough comparison, a la diff, to show everything? Much less costly to do for ~50 pairs?

Jelleas commented 4 years ago

The difficult thing with a diff is alignment. The code may be shuffled a bit, and odds are similar pieces of code appear in different parts of submissions. And it's probably only those pieces that a diff is at all interesting for.

For text/exact this idea of a diff would be tricky because these methods only find matches that have no difference.structure however is rather robust at finding similar pieces of code, and this gives us an entry point for aligning those pieces of code to then create a diff.

One idea I've had then is a "per-match view", where the view can take the matches in code from one comparison method (realistically structure), and overlay the information from other comparison methods or from a diff. Essentially this lets you zoom in on details within similar code. And hopefully quickly point out gotchas within that code. In my mind, we could then easily introduce specialized comparison methods (for naming for instance), and use that method to point out oddities in already similar snippets of code, rather than the entire submission.

dlloyd09 commented 3 years ago

FWIW, I believe this came up in our most recent review of cases, again. Not sure if further thoughts have been made on how to potentially resolve, just a data point!