Closed eltfshr closed 10 months ago
I agree with your assessment. Thanks for providing a simple case to demonstrate the problem. I've been contemplating reworking the diff logic for a while and this issue convinces me the time has come to do it. It's a bit of an undertaking so will take a while.
I finally understand what's going on. The first step in cloc's process to diff two code files is to categorize each line in both files as being either code or comment. Lines with both code and comments are counted as code lines.
Categorization works like this: first the comment language filters are applied to a file to strip all comments. Then a diff algorithm (sdiff
from the Algorithm::Diff
Perl module) is used to compare the original file with the version stripped of comments. Lines that have been removed are comments and everything else is code.
Your original 1.java
file is
// case SET: nrPerformed = 0; //r.getInteger(0, 5); break; // case ANALYZE: // nrPerformed = 0; //r.getInteger(0, 3); // break;
and without comments is
nrPerformed = 0; break;
If you save the no-comment version to a file, for example 1.java.nc
and use a command line diff tool (diff
, vimdiff
, xxdiff
, meld
, etc) to compare 1.java
with 1.java.nc
you'll see the problem. To make this explicit, here's the output of sdiff
:
» sdiff 1.java 1.java.nc // case SET: | nrPerformed = 0; nrPerformed = 0; //r.getInteger(0, 5); < break; break; // case ANALYZE: < // nrPerformed = 0; //r.getInteger(0, 3); < // break; <
In other words, the first line of 1.java
, a comment, is falsely paired with the first line of 1.java.nc
, which is code. I'm unaware of any diff algorithm that is smart enough to pair the second line of 1.java
with the first line of 1.java.nc
.
This incorrect pairing makes cloc think the first line of 1.java
is code rather than comment, hence the false result that one line of code has changed.
If the in-line comment were preserved in the comment removal step, ie 1.java.nc
had nrPerformed = 0; //r.getInteger(0, 5);
instead of just nrPerformed = 0;
then the results would be computed correctly.
It would be a non-trivial effort to add new filter options to only remove pure comment lines. Perhaps you can think of a better algorithm to distinguish code lines from comment lines?
Short of a full rewrite, I can't think of a workable solution to solve this problem. I'll document the as a limitation that --diff
cannot be 100% reliable.
Describe the bug cloc counts comments incorrectly as codes when counting diffs between 2 java files. I added comments into 3 existing comment lines, but cloc counts one of them as code.
cloc; OS; OS version
To Reproduce I reduce the files to small snippets here: 1.java
2.java
Then count diffs with
docker run --rm -v ${PWD}:/tmp aldanial/cloc --diff 1.java 2.java
:Expected result I expect the modified comment to be 3 and modified code to be 0.
Additional context
Testing to see which line is counted as code:
remove first comment from 2.java:
result:
remove second comment from 2.java:
result:
remove third comment from 2.java:
result:
it seems like cloc counts the first comment as code.
if I remove the first line completely from both files: 1.java:
2.java:
result:
cloc now treats the removed line as comment, but still count another comment as code.