AlDanial / cloc

cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.
GNU General Public License v2.0
19.78k stars 1.02k forks source link

Incorrect diffs count between code and comment #774

Closed eltfshr closed 10 months ago

eltfshr commented 1 year ago

Describe the bug cloc counts comments incorrectly as codes when counting diffs between 2 java files. I added comments into 3 existing comment lines, but cloc counts one of them as code.

cloc; OS; OS version

To Reproduce I reduce the files to small snippets here: 1.java

        // case RESET_ROLE:
        // case SET:
            nrPerformed = 0; //r.getInteger(0, 5);
            break;
        // case ANALYZE:
        //     nrPerformed = 0; //r.getInteger(0, 3);
        //     break;

2.java

        // case RESET_ROLE: comment
        // case SET: comment
            nrPerformed = 0; //r.getInteger(0, 5);
            break;
        // case ANALYZE: comment
        //     nrPerformed = 0; //r.getInteger(0, 3);
        //     break;

Then count diffs with docker run --rm -v ${PWD}:/tmp aldanial/cloc --diff 1.java 2.java:

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java
 same                            0              0              3              1
 modified                        1              0              2              1
-------------------------------------------------------------------------------

Expected result I expect the modified comment to be 3 and modified code to be 0.

Additional context

Testing to see which line is counted as code:

remove first comment from 2.java:

        // case RESET_ROLE:
        // case SET: comment
            nrPerformed = 0; //r.getInteger(0, 5);
            break;
        // case ANALYZE: comment
        //     nrPerformed = 0; //r.getInteger(0, 3);
        //     break;

result:

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java
 same                            0              0              3              2
 modified                        1              0              2              0
-------------------------------------------------------------------------------

remove second comment from 2.java:

        // case RESET_ROLE: comment
        // case SET:
            nrPerformed = 0; //r.getInteger(0, 5);
            break;
        // case ANALYZE: comment
        //     nrPerformed = 0; //r.getInteger(0, 3);
        //     break;

result:

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java
 same                            0              0              4              1
 modified                        1              0              1              1
-------------------------------------------------------------------------------

remove third comment from 2.java:

        // case RESET_ROLE: comment
        // case SET: comment
            nrPerformed = 0; //r.getInteger(0, 5);
            break;
        // case ANALYZE:
        //     nrPerformed = 0; //r.getInteger(0, 3);
        //     break;

result:

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java
 same                            0              0              4              1
 modified                        1              0              1              1
-------------------------------------------------------------------------------

it seems like cloc counts the first comment as code.

if I remove the first line completely from both files: 1.java:

        // case SET:
            nrPerformed = 0; //r.getInteger(0, 5);
            break;
        // case ANALYZE:
        //     nrPerformed = 0; //r.getInteger(0, 3);
        //     break;

2.java:

        // case SET: comment
            nrPerformed = 0; //r.getInteger(0, 5);
            break;
        // case ANALYZE: comment
        //     nrPerformed = 0; //r.getInteger(0, 3);
        //     break;

result:

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java
 same                            0              0              3              1
 modified                        1              0              1              1
-------------------------------------------------------------------------------

cloc now treats the removed line as comment, but still count another comment as code.

AlDanial commented 1 year ago

I agree with your assessment. Thanks for providing a simple case to demonstrate the problem. I've been contemplating reworking the diff logic for a while and this issue convinces me the time has come to do it. It's a bit of an undertaking so will take a while.

AlDanial commented 11 months ago

I finally understand what's going on. The first step in cloc's process to diff two code files is to categorize each line in both files as being either code or comment. Lines with both code and comments are counted as code lines. Categorization works like this: first the comment language filters are applied to a file to strip all comments. Then a diff algorithm (sdiff from the Algorithm::Diff Perl module) is used to compare the original file with the version stripped of comments. Lines that have been removed are comments and everything else is code.

Your original 1.java file is

    // case SET:
        nrPerformed = 0; //r.getInteger(0, 5);
        break;
    // case ANALYZE:
    //     nrPerformed = 0; //r.getInteger(0, 3);
    //     break;

and without comments is

        nrPerformed = 0;
        break;

If you save the no-comment version to a file, for example 1.java.nc and use a command line diff tool (diff, vimdiff, xxdiff, meld, etc) to compare 1.java with 1.java.nc you'll see the problem. To make this explicit, here's the output of sdiff:

» sdiff 1.java 1.java.nc
// case SET:                                  |     nrPerformed = 0;
    nrPerformed = 0; //r.getInteger(0, 5);    <
    break;                                          break;
// case ANALYZE:                              <
//     nrPerformed = 0; //r.getInteger(0, 3); <
//     break;                                 <

In other words, the first line of 1.java, a comment, is falsely paired with the first line of 1.java.nc, which is code. I'm unaware of any diff algorithm that is smart enough to pair the second line of 1.java with the first line of 1.java.nc.

This incorrect pairing makes cloc think the first line of 1.java is code rather than comment, hence the false result that one line of code has changed.

If the in-line comment were preserved in the comment removal step, ie 1.java.nc had nrPerformed = 0; //r.getInteger(0, 5); instead of just nrPerformed = 0; then the results would be computed correctly.

It would be a non-trivial effort to add new filter options to only remove pure comment lines. Perhaps you can think of a better algorithm to distinguish code lines from comment lines?

AlDanial commented 10 months ago

Short of a full rewrite, I can't think of a workable solution to solve this problem. I'll document the as a limitation that --diff cannot be 100% reliable.