Open rdiachenko opened 4 years ago
Here're few more unverified ideas to check and play with.
Given a patch we can try to obtain previous code version by reverting that patch. In git it can be done via git apply -R <patch>
. Having old and new code we can apply Meyer's
or Histogram
diff algorithms (either from the currently used org.eclipse.jgit
library or java-diff-utils
and find start/end position of changed code within a line of code. Those positions are columnNo
in Checkstyle terminology which can be use for more accurate violation filtration.
In the org.eclipse.jgit
changes are represented as a list of org.eclipse.jgit.diff.Edit
instances. Each Edit
instance contains start / end lines of code blocks which were updated:
// start / end lines in code block before patch
int beginA;
int endA;
// start / end lines in code block after patch
int beginB;
int endB;
This information can potentially be used in detecting start / end columnNo
within a line marking new code or replaced one.
We can also try to configure diff algorithms mentioned above to skip whitespace characters. However, whitespace related checks may be affected. We can also consider a size of changed code block, if it's one line or few character change it probably makes sense to just skip such outliers and ignore new violations.
Given AST tree of the current code we can try to obtain an AST tree for the code before patch. Having old and new trees we can try to accept only new violations by comparing both trees.
Another option to investigate is to run Checkstyle second time on the old AST tree and just get a diff between two runs.
This option won't work on input files which are not represented as AST trees.
Considering Checkstyle is just a library one may write a plugin to gradle / maven / etc. or just a CI script to run Checkstyle twice (on code before and after patch) and fail if diff is not empty. May not be a viable option if codebase is huge, however, it may still be faster than running all the tests.
Another option is to cache Checkstyle execution on master / main and compare with new execution on pathed code in a different branch.
Existing attempts to solve the problem:
newline
strategy, doesn't solve problem mentioned in this issueChecking the following idea.
[main]$ git checkout -b feature
feature
and commit them[feature]$ git diff $(git merge-base main feature) feature > feature.patch
`git merge-base` gives the most recent common commit between the branches.
Using `origin/main` should help in case local main is outdated.
2. List all modified files
[feature]$ git diff --name-only $(git merge-base main feature) \ feature > modified-files.txt
3. Apply the patch in reverse
[feature]$ git apply -R feature.patch
4. Run Checkstyle against modified files
[feature]$ java -jar checkstyle/checkstyle-10.9.3-all.jar \ -c checkstyle/config.xml $(cat modified-files.txt) > checkstyle-main.txt
5. Apply the patch
[feature]$ git apply feature.patch
6. Run Checkstyle against modified files
[feature]$ java -jar checkstyle/checkstyle-10.9.3-all.jar \ -c checkstyle/config.xml $(cat modified-files.txt) > checkstyle-feature.txt
7. Compare Checkstyle executions
[feature]$ comm -13 <(sort checkstyle-main.txt) <(sort checkstyle-feature.txt)
#### Todo
1. Need to compare Checkstyle runs in a more clever way. Some existing violations may change their position which gives false postitives.
2. How to generate Checkstyle report for new violations only?
Only comparison of xpath on violation , can help with detection of new or very different by structure same violation ( ok to keep as new). We have feature in CLI to print xpath on violations, it can help.
Comparison by line and column will be too unreliable.
We might end up in state to figure out algorithm how to compare two xpath and consider them the same.
Column strategy should be able to distinguish between existing and new violations on the changed code lines and show new violations only. Current
patchedline
strategy can't understand what exactly was changed in the line (e.g. a whitespace was added between words), so it may show old violations as well.Investigate ways to implement a column strategy. One of them can be usage of https://github.com/google/diff-match-patch to compare an old and a new line and get the columns within those lines with a new text.