blingenf / copydetect

Code plagiarism detection tool
MIT License
226 stars 37 forks source link

copydetect

Screenshot of copydetect code comparison output

Overview

Copydetect is a code plagiarism detection tool based on the approach proposed in "Winnowing: Local Algorithms for Document Fingerprinting" and used for the popular MOSS platform. Copydetect takes a list of directories containing code as input, and generates an HTML report displaying copied slices as output. The implementation takes advantage of fast numpy functions for efficient generation of results. Code tokenization is handled by Pygments, so all 500+ languages which pygments can detect and tokenize are in turn supported by copydetect.

Note that, like MOSS, copydetect is designed to detect likely instances of plagiarism; it is not guaranteed to catch cheaters dedicated to evading it, and it does not provide a guarantee that plagiarism has occurred.

Installation

Copydetect can be installed using pip install copydetect. Note that Python version 3.7 or greater is required. You can then generate a report using the copydetect command (copydetect.exe on Windows. If your scripts folder is not in your PATH the code can also be run using py.exe -m copydetect).

Usage

The simplest usage is copydetect -t DIRS, where DIRS is a space-separated list of directories to search for input files. This will recursively search for all files in the provided directories and compare every file with every other file. To look only at specific file extensions, use -e followed by another space-separated list (for example, copydetect -t student_code -e cc cpp h)

If the files you want to compare to are different from the files you want to check for plagiarism (for example, if you want to also compare to submissions from previous semesters), use -r to provide a list of reference directories. For example, copydetect -t PA01_F20 -r PA01_F20 PA01_S20 PA01_F19. To avoid matches with code that was provided to students, use -b to specify a list of directories containing boilerplate code.

There are several options for tuning the sensitivity of the detector. The noise threshold, set with -n, is the minimum number of matching characters between two documents that is considered plagiarism. Note that this is AFTER tokenization and filtering, where variable names have been replaced with V, function names with F, etc. If you change -n (default value: 25), you will also have to change the guarantee threshold, -g (default value: 30). This is the number of matching characters for which the detector is guaranteed to detect the match. If speed isn't an issue, you can set this equal to the noise threshold. Finally, the display threshold, -d (default value: 0.33), is used to determine what percentage of code similarity is considered interesting enough to display on the output report. The distribution of similarity scores is plotted on the output report to assist selection of this value.

There are several other command line options for different use cases. If you only want to check for "lazy" plagiarism (direct copying without changing variable names or reordering code), -f can be used to disable code filtering. If you don't want to compare files in the same leaf directory (for example, if code is split into per-student directories and you don't care about self plagiarism), use -l. For a complete list of configuration options, see the following section.

Configuration Options

Configuration options can be provided either by using the command line arguments or by using a JSON file. If a JSON file is used, specify it on the command line using -c (e.g., copydetect -c configuration.json). A sample configuration file is available here. The following list provides the names of each JSON configuration key along with its associated command line arguments.

Advanced options:

API

Copydetect can also be run via the python API. An example of basic usage is provided below. API documentation is available here.

>>> from copydetect import CopyDetector
>>> detector = CopyDetector(test_dirs=["tests"], extensions=["py"],
...                         display_t=0.5)
>>> detector.add_file("copydetect/utils.py")
>>> detector.run()
  0.00: Generating file fingerprints
   100%|████████████████████████████████████████████████████| 8/8
  0.31: Beginning code comparison
   100%|██████████████████████████████████████████████████| 8/8
  0.31: Code comparison completed
>>> detector.generate_html_report()
Output saved to report/report.html