Project-Fungus / fungus-cli

Command-line plagiarism detection tool for ARMv7 assembly.
MIT License
0 stars 0 forks source link

fungus-cli

FUNGUS is a tool for detecting similarities between ARMv7 assembly projects, for example, for introductory software assignments. This is the command-line tool which performs the analysis and generates a plagiarism report in JSON format. It is meant to be used in conjunction with a desktop GUI, such as fungus-gui.

FUNGUS is inspired by Stanford's Measure of Software Similarity (Moss). At its core, it uses the same algorithm, winnowing, described in this paper.

Installation

Binary

  1. Go to the Releases page.
  2. Download the artifact for your platform.
  3. Add the command to your PATH.

Run fungus --version to check that the installation was successful.

Building From Source

  1. Clone this repository or download the code.
  2. Ensure you have installed Cargo.
  3. Run cargo build --release. The binary will be placed in the target/release/ directory.

Key Inputs

Root

FUNGUS assumes the projects to analyze are all in separate directories, each a direct child of the same root directory. For example, consider the following directory structure:

submissions/
├── project1
│   ├── subdir1
│   │   └── file1.s
│   └── subdir2
│       └── file2.s
├── project2
│   ├── code1.s
│   └── code2.s
└── starter-code
    ├── file1.s
    └── file2.s

If the submissions/ directory is selected as the root, then FUNGUS will select project1, project2, and starter-code as the projects to compare.

Starter Code

Paths to ignore (e.g., assignment starter code provided to all students) can be given as input to FUNGUS. Any code in students' projects that match this code will not be flagged as potential plagiarism. The paths to ignore can be inside the root directory (as in the example above) or outside of it.

Tokenizer

Two tokenizers are available:

Noise Threshold, Guarantee Threshold, and Max Token Offset

FUNGUS accepts noise and guarantee thresholds as inputs.

In addition, when using the "relative" tokenizer, an additional max token offset can be specified. This is the maximum value of the distance for relative tokens. Intuitively, choosing a very small max offset will probably result in many false positives. In the extreme case of the max offset being 0, this reduces to non-relative lexing but with no distinction between registers, labels, etc. Conversely, choosing a very large max offset will probably result in many false negatives. In the extreme case of there being no limit, the results depend on the overall structure of the document. In that case, there is no guarantee that any matches will be reported (unless two files are identical).

Output Format

{
    "warnings": [
        {
            "file": "project1/my_invalid_file.s",
            "message": "Message explaining what's wrong.",
            "warn_type": "Type"
        }
    ],
    "project_pairs": [
        {
            "project1": "Project 1",
            "project2": "Project 2",
            "matches": [
                {
                    "project_1_location": {
                        "file": "Project 1/code.s",
                        "span": {
                            "start": 0,
                            "end": 42
                        }
                    },
                    "project_2_location": {
                        "file": "Project 2/my_code.s",
                        "span": {
                            "start": 100,
                            "end": 150
                        }
                    }
                }
            ]
        }
    ]
}

Note that: