checkstyle / patch-filters

Suppression Filter for Checkstyle that is based on patch file
GNU Lesser General Public License v2.1
4 stars 6 forks source link

FUTURE: Gather ideas for column strategy #336

Open rdiachenko opened 4 years ago

rdiachenko commented 4 years ago

Column strategy should be able to distinguish between existing and new violations on the changed code lines and show new violations only. Current patchedline strategy can't understand what exactly was changed in the line (e.g. a whitespace was added between words), so it may show old violations as well.

Investigate ways to implement a column strategy. One of them can be usage of https://github.com/google/diff-match-patch to compare an old and a new line and get the columns within those lines with a new text.

rdiachenko commented 1 year ago

Here're few more unverified ideas to check and play with.

Try to detect location of changed code more accurate

Given a patch we can try to obtain previous code version by reverting that patch. In git it can be done via git apply -R <patch>. Having old and new code we can apply Meyer's or Histogram diff algorithms (either from the currently used org.eclipse.jgit library or java-diff-utils and find start/end position of changed code within a line of code. Those positions are columnNo in Checkstyle terminology which can be use for more accurate violation filtration.

In the org.eclipse.jgit changes are represented as a list of org.eclipse.jgit.diff.Edit instances. Each Edit instance contains start / end lines of code blocks which were updated:

// start / end lines in code block before patch
int beginA;
int endA;

// start / end lines in code block after patch
int beginB;
int endB;

This information can potentially be used in detecting start / end columnNo within a line marking new code or replaced one.

We can also try to configure diff algorithms mentioned above to skip whitespace characters. However, whitespace related checks may be affected. We can also consider a size of changed code block, if it's one line or few character change it probably makes sense to just skip such outliers and ignore new violations.

Use AST trees to accept only new violations

Given AST tree of the current code we can try to obtain an AST tree for the code before patch. Having old and new trees we can try to accept only new violations by comparing both trees.

Another option to investigate is to run Checkstyle second time on the old AST tree and just get a diff between two runs.

This option won't work on input files which are not represented as AST trees.

Do we need this in Checkstyle?

Considering Checkstyle is just a library one may write a plugin to gradle / maven / etc. or just a CI script to run Checkstyle twice (on code before and after patch) and fail if diff is not empty. May not be a viable option if codebase is huge, however, it may still be faster than running all the tests.

Another option is to cache Checkstyle execution on master / main and compare with new execution on pathed code in a different branch.

rdiachenko commented 1 year ago

Existing attempts to solve the problem:

rdiachenko commented 1 year ago

Checking the following idea.

Do we need this in Checkstyle?

  1. Find a diff between main and current branch
    
    [main]$ git checkout -b feature

Apply changes to feature and commit them

[feature]$ git diff $(git merge-base main feature) feature > feature.patch


`git merge-base` gives the most recent common commit between the branches.

Using `origin/main` should help in case local main is outdated.

2. List all modified files

[feature]$ git diff --name-only $(git merge-base main feature) \ feature > modified-files.txt


3. Apply the patch in reverse

[feature]$ git apply -R feature.patch


4. Run Checkstyle against modified files

[feature]$ java -jar checkstyle/checkstyle-10.9.3-all.jar \ -c checkstyle/config.xml $(cat modified-files.txt) > checkstyle-main.txt


5. Apply the patch

[feature]$ git apply feature.patch


6. Run Checkstyle against modified files

[feature]$ java -jar checkstyle/checkstyle-10.9.3-all.jar \ -c checkstyle/config.xml $(cat modified-files.txt) > checkstyle-feature.txt


7. Compare Checkstyle executions

[feature]$ comm -13 <(sort checkstyle-main.txt) <(sort checkstyle-feature.txt)



#### Todo
1. Need to compare Checkstyle runs in a more clever way. Some existing violations may change their position which gives false postitives.
2. How to generate Checkstyle report for new violations only?
romani commented 1 year ago

Only comparison of xpath on violation , can help with detection of new or very different by structure same violation ( ok to keep as new). We have feature in CLI to print xpath on violations, it can help.

Comparison by line and column will be too unreliable.

We might end up in state to figure out algorithm how to compare two xpath and consider them the same.