This PR adds the ability to ignore template code by manually specifying ignore files or by setting a maximum count or percentage of files a code fragment can occur in before it is ignored.
Note: this feature is currently experimental. We're not convinced of the initial results and will be performing more tests to see whether this functionality would actually improve plagiarism reports or not.
This option is currently not available in the web server, however we are thinking how to implement this (see #1535).
Closes #1213, #716, #1163
Meanwhile the following changes have been done to the dolos, dolos-core, and dolos-lib npm packages:
API changes
CLI
Added a new option -i, --ignore <path> to ignore matches with that file in the analysis
Fixed the options -m, --max-fingerprint-count <integer> and -M, --max-fingerprint-percentage <fraction> to ignore matches if the code is present in more than that count/percentage of files
Core
FingerprintIndex now has the ability to ignore files or fingerprints occurring in more than a specified amount of files.
constructor: added optional argument maxFingerprintFileCount which can be used to set the maximum number of files a fingerprint can occur in before it is ignored.
new function addIgnoredFile(file: TokenizedFile): void can be used to ignore all the fingerprints in a file.
new function ignoredEntries(): Array<FileEntry> to retrieve all ignored files.
new functions getMaxFingerprintFileCount(): number and updateMaxFingerprintFileCount(maxFingerprintFileCount: number | undefined) to retrieve and update the maxFingerprintFileCount. The change will immediately change the index to reflect this value.
new function addIgnoredHashes(hashes: Array<Hash>) which can be used to manually ignore certain hashes.
interface FileEntry: added field ignored: Set<SharedFingerprints> to track ignored fingerprints and field isIgnored: boolean to sign whether this file is an ignored file or not.
SharedFingerprint now has a boolean ignored to reflect whether this shared fingerprint is ignored or not.
new function includesFile(file: TokenizedFile): boolean to request whether this fingerprint ins included in the given file.
Lib
Dolos class now has the option to ignore a file or ignore fingeprints occuring in more than a specified amount or percentage of files
The options maxFingerprintCount and maxFingerprintPercentage now have an effect (they were previously ignored): code matchign with more than this count or percentage of files will be ignored
analyzePaths has an extra optional parameter ignore?: string which can be set to the path of the file to ignore
analyze has an extra optional parameter ignoredFile?: File which can be set to the File to ignore
Report class now has an extra function ignoredEntries(): Array<FileEntry> to retrieve the files that have been ignored
Experimental results
To observe the effects of ignoring template code, we've run Dolos on a recent case of plagiarism.
The cases with confirmed plagiarism are present in the baseline comparison with a high similarity 79% and are present in one of the four clusters.
Throughout all the configurations, these cases are present the identified clusters. However the similarities decrease with the aggressiveness of the -M option and the other clusters vary a little.
Even with -M .25 the confirmed cases are on top of the highest ranking submissions and comparing them does not differ much.
Baseline (no ignoring)
Ignore template code (-i boilerplate.java)
Ignore fingerprints occurring in 75% of files (-M .75)
Ignore fingerprints occurring in 50% of files (-M .50)
Ignore fingerprints occurring in 25% of files (-M .25)
Ignore template code AND fingerprints in 75% of files (-i boilerplate.java -M .75)
This PR adds the ability to ignore template code by manually specifying ignore files or by setting a maximum count or percentage of files a code fragment can occur in before it is ignored.
Note: this feature is currently experimental. We're not convinced of the initial results and will be performing more tests to see whether this functionality would actually improve plagiarism reports or not.
This option is currently not available in the web server, however we are thinking how to implement this (see #1535).
Closes #1213, #716, #1163
Meanwhile the following changes have been done to the
dolos
,dolos-core
, anddolos-lib
npm packages:API changes
CLI
-i, --ignore <path>
to ignore matches with that file in the analysis-m, --max-fingerprint-count <integer>
and-M, --max-fingerprint-percentage <fraction>
to ignore matches if the code is present in more than that count/percentage of filesCore
FingerprintIndex
now has the ability to ignore files or fingerprints occurring in more than a specified amount of files.constructor
: added optional argumentmaxFingerprintFileCount
which can be used to set the maximum number of files a fingerprint can occur in before it is ignored.addIgnoredFile(file: TokenizedFile): void
can be used to ignore all the fingerprints in a file.ignoredEntries(): Array<FileEntry>
to retrieve all ignored files.getMaxFingerprintFileCount(): number
andupdateMaxFingerprintFileCount(maxFingerprintFileCount: number | undefined)
to retrieve and update themaxFingerprintFileCount
. The change will immediately change the index to reflect this value.addIgnoredHashes(hashes: Array<Hash>)
which can be used to manually ignore certain hashes.FileEntry
: added fieldignored: Set<SharedFingerprints>
to track ignored fingerprints and fieldisIgnored: boolean
to sign whether this file is an ignored file or not.SharedFingerprint
now has a booleanignored
to reflect whether this shared fingerprint is ignored or not.includesFile(file: TokenizedFile): boolean
to request whether this fingerprint ins included in the given file.Lib
Dolos
class now has the option to ignore a file or ignore fingeprints occuring in more than a specified amount or percentage of filesmaxFingerprintCount
andmaxFingerprintPercentage
now have an effect (they were previously ignored): code matchign with more than this count or percentage of files will be ignoredanalyzePaths
has an extra optional parameterignore?: string
which can be set to the path of the file to ignoreanalyze
has an extra optional parameterignoredFile?: File
which can be set to theFile
to ignoreReport
class now has an extra functionignoredEntries(): Array<FileEntry>
to retrieve the files that have been ignoredExperimental results
To observe the effects of ignoring template code, we've run Dolos on a recent case of plagiarism.
The cases with confirmed plagiarism are present in the baseline comparison with a high similarity 79% and are present in one of the four clusters.
Throughout all the configurations, these cases are present the identified clusters. However the similarities decrease with the aggressiveness of the
-M
option and the other clusters vary a little.Even with
-M .25
the confirmed cases are on top of the highest ranking submissions and comparing them does not differ much.Baseline (no ignoring)
Ignore template code (
-i boilerplate.java
)Ignore fingerprints occurring in 75% of files (
-M .75
)Ignore fingerprints occurring in 50% of files (
-M .50
)Ignore fingerprints occurring in 25% of files (
-M .25
)Ignore template code AND fingerprints in 75% of files (
-i boilerplate.java -M .75
)