dodona-edu / dolos

:detective: Source code plagiarism detection
https://dolos.ugent.be
MIT License
234 stars 30 forks source link

[experimental] Add ability to ignore template code or frequently occuring fingerprints #1524

Closed rien closed 1 month ago

rien commented 2 months ago

This PR adds the ability to ignore template code by manually specifying ignore files or by setting a maximum count or percentage of files a code fragment can occur in before it is ignored.

Note: this feature is currently experimental. We're not convinced of the initial results and will be performing more tests to see whether this functionality would actually improve plagiarism reports or not.

This option is currently not available in the web server, however we are thinking how to implement this (see #1535).

Closes #1213, #716, #1163

Meanwhile the following changes have been done to the dolos, dolos-core, and dolos-lib npm packages:

API changes

CLI

Core

Lib

Experimental results

To observe the effects of ignoring template code, we've run Dolos on a recent case of plagiarism.

The cases with confirmed plagiarism are present in the baseline comparison with a high similarity 79% and are present in one of the four clusters.

Throughout all the configurations, these cases are present the identified clusters. However the similarities decrease with the aggressiveness of the -M option and the other clusters vary a little.

Even with -M .25 the confirmed cases are on top of the highest ranking submissions and comparing them does not differ much.

Baseline (no ignoring)

image

Ignore template code (-i boilerplate.java)

image

Ignore fingerprints occurring in 75% of files (-M .75)

image

Ignore fingerprints occurring in 50% of files (-M .50)

image

Ignore fingerprints occurring in 25% of files (-M .25)

image

Ignore template code AND fingerprints in 75% of files (-i boilerplate.java -M .75)

image