This PR partially implements #431 and should be able to detect #265 and #442 (if the code segments are sufficiently large).
What is working?
Relatively fast detection of duplicate segments. On the test_submission A1 it takes around 70ms vs 98ms (comment language check) and Repeated Math Operation Check with 48ms. This could be even faster by further improving the StructuralHashCodeVisitor to have a minimum amount of hashCode collisions (at the cost of not detecting some duplicate codes). For example, by hashing all named elements, the time can be further reduced to only 12ms, but that is not feasible, because it should be able to detect duplicate code segments, where only the name of a variable is different.
Detecting duplicates ignoring names and comments
What is missing?
[ ] Adjusting the required number of statements for detecting a code duplicate based on where it is found (e.g. in an if-else) and how many differences there are (like a 1:1 copy vs one that requires multiple variables)
[ ] Code where the type is almost the same (the example where List/Set was used, but Collection would be necessary in a helper method)
[x] Counting the number of required variables to a potential helper method (it should not lint code segments where a lot of parameters are required to refactor into a method)
[x] Write a lot of tests and check for feasibility of creating a helper method.
[ ] (Suggest a helper method, difficult to implement, but might be worth it)
[x] Add some safeguards to check that the StructuralEquality works well. (by comparing it to the slower version in debug mode)
[ ] Remove the CPDLinter Code
[ ] Evaluate when a differing expression can be replaced by a parameter (technically always by doing something like if (paramIsTrue) { doA(); } else { doB(); })
This PR partially implements #431 and should be able to detect #265 and #442 (if the code segments are sufficiently large).
What is working?
StructuralHashCodeVisitor
to have a minimum amount of hashCode collisions (at the cost of not detecting some duplicate codes). For example, by hashing all named elements, the time can be further reduced to only 12ms, but that is not feasible, because it should be able to detect duplicate code segments, where only the name of a variable is different.What is missing?
StructuralEquality
works well. (by comparing it to the slower version in debug mode)if (paramIsTrue) { doA(); } else { doB(); }
)