blingenf / copydetect

Code plagiarism detection tool
MIT License
224 stars 36 forks source link

Feature Request: Ignore n levels of "leafs" #51

Open incaseoftrouble opened 4 months ago

incaseoftrouble commented 4 months ago

Hi!

I am using copydetect together with DOMjudge submissions. Effectively, the situation is that students can submit attempts at a problem multiple times. I want to check for plagiarism across students, but not within their own attempts. Here, ignore_leaf comes into play, since I can organize submissions by problem / student / <student attempts>. But if one attempt could contain multiple files, I'd like to organize submissions by problem / student / submission / <submission files>. Then, ignore_leaf doesn't help much. (Of course, I could still flatten all the student submissions into one directory, but I think there is an "easier" way)

Suggestion: Instead of making ignore_leaf a boolean flag, make it an integer, and two files are not compared to each other when the n-th parent is the same.

This should be easily doable in https://github.com/blingenf/copydetect/blob/d3667ff110e5911eb57b162535f18b1cb6e2f8c2/copydetect/detector.py#L422-L429 I can provide a PR if you like!

blingenf commented 4 months ago

Thanks for the suggestion -- I'd like to see if there's an easy way with argparse to allow ignore_leaf to be used both as a boolean parameter and with an optional numeric argument to control the leaf depth. That way any existing usages won't be affected.

PRs are always welcome! I will get around to this myself eventually if you don't have the time to do a PR but I have unfortunately been very slow with this project recently.

incaseoftrouble commented 4 months ago

I'd like to see if there's an easy way with argparse to allow ignore_leaf to be used both as a boolean parameter and with an optional numeric argument to control the leaf depth. That way any existing usages won't be affected.

I think not (a following positional argument could not be distinguished from the parameter), but I would maybe suggest something like

ignore_group = parser.add_mutually_exclusive_group()
ignore_group.add_argument("--ignore_leaf", action="store_const", dest="ignore_depth", const=1)
ignore_group.add_argument("--ignore_depth", dest="ignore_depth", type=int)

this should work. I can PR this soon enough. The biggest question is how properly name this other parameter :)


However, there also is the question whether equivalence shouldn't also be possible "top down". In the sense that two files are not compared if the first n components of their (relative) paths are the same. This is the same thing if all submissions are on the same level, but suppose that students submit several java projects, which could have varying depth.

blingenf commented 4 months ago

I think that's a good solution. I think ignore_depth is a perfectly fine name as long as the help string properly describes what it does.

You bring up a good point about whether it should look top down or bottom up, though -- it actually makes more sense to me to go from the top down (in this case to the depth of the student folders). How many subfolders there are within each of the student folders shouldn't matter.

Is there a scenario where a bottom-up search would work but a top-down one would not? I guess there could always be some kind of --ignore_direction parameter, but I think with three different parameters for controlling this behavior it might confuse more people than it would help.

incaseoftrouble commented 4 months ago

Is there a scenario where a bottom-up search would work but a top-down one would not?

I can't really imagine one. But that doesn't say that much :)

I guess there could always be some kind of --ignore_direction parameter, but I think with three different parameters for controlling this behavior it might confuse more people than it would help.

in case both sides should be supported, I think the best variant would be to have three mutually exclusive options like --ignore-after, --ignore-leaves, and --ignore-depth.

I'd go ahead and draft a PR. The options could still be removed / hidden if needed.

EDIT: Ok, I looked at this a bit, and top-down needs more restructuring. The problem is that this "ignore the first n" should probably be relative to the base directory from which this file has been picked up. This means that the file list needs to track where the file came from (and what happens if one file effectively appears in multiple directories).

I guess that this path-based reasoning should be with logical paths only, but fingerprinting etc. is cached by real path (path.resolve().absolute()). So there would be one object per "logical path" (= base directory + path of the file within the base directory) with a many-to-one relationship to the physical file object (which would handle the fingerprints etc.).

I can take a look at this if you want, but we should agree beforehand whether these more involved changes are ok.

blingenf commented 3 months ago

Hmm -- I think this would also run into trouble with the file adding API. I'm not sure how many people are using the API instead of the command-line interface but files added using, e.g., detector.add_file("copydetect/utils.py") don't necessarily have a base directory. I would be fine with a solution which takes just the test_dirs / ref_dirs arguments and performs some bookkeeping as you describe to track the base directory for each file, with manually added files using the working directory as the base directory. But the way files are tracked is definitely not designed for the top-down approach and I could see this taking a lot more effort than expected.

The bottom-up search should be much easier to implement and should handle most use cases.

incaseoftrouble commented 3 months ago

One more elegant way would be to separate testing from defining pairs. So that if you add thing with testing and reference directories etc., a set of pairs to compare is created, and the actual file comparison then is performed on this set.

But I agree that bottom up probably is good enough for now. I think we can leave this issue open. In case someone has a need for this feature, they could comment here.


Compromise suggestion: What about adding this separation into "obtain the comparison pairs" and "perform comparison"? Then one could add an API like add_comparison_pair(file_one, file_two) so that it would be reasonably easy to add this directly, if needed.

blingenf commented 3 months ago

I agree that generating a set of comparison pairs would be a nice solution for the current design, though it may conflict with the possibility of having a "one to many" comparison where similarity could be computed between a single test file and the reference files in aggregate (this has been requested in multiple issues so it would probably be useful to implement eventually).

At some point I'd like to rethink the architecture and do a more complete refactor -- I wrote the original version of this package in a week or two for a pretty simple use-case. I just don't personally use the package anymore (I haven't been in academia for a few years now) so it's hard to have the motivation :slightly_smiling_face: