AlDanial / cloc

cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.
GNU General Public License v2.0
19.75k stars 1.02k forks source link

Bug on --strip-str-comments #719

Closed cchaniotaki closed 1 year ago

cchaniotaki commented 1 year ago

When a file has a string with // inside (eg. HTTP://, file:// etch.) then the clock removes it because it thinks that is a comment (in languages that // is a comment). For example, I ran it on jdk and on file "TestFinalizerStatisticsEvent.java" (and other files also) if you run it, you will see the bug.

To Reproduce Run: cloc ./jdk --skip-uniqueness --processes=8 --strip-comments=BAK --original-dir

On .BAK files you will see the problem if they have a string with // on it.

Before removing comments

switch (overridingClass.getName()) {
    case TEST_CLASS_NAME: {
        Asserts.assertTrue(event.getString("codeSource").startsWith("file://"));
        foundTestClassName = true;
        break;
    }
    case TEST_CLASS_UNLOAD_NAME: {
        foundTestClassUnloadName = true;
        break;
    }
}

After removing comments

switch (overridingClass.getName()) {
    case TEST_CLASS_NAME: {
        Asserts.assertTrue(event.getString("codeSource").startsWith("file:
        foundTestClassName = true;
        break;
    }
    case TEST_CLASS_UNLOAD_NAME: {
        foundTestClassUnloadName = true;
        break;
    }
}

Languages that I have notices this problem Java, C, C++, C#

AlDanial commented 1 year ago

Unfortunately there's no straightforward way (at least I can't think of one) to solve this with a regex-based tool such as cloc. Only a true language parser can make sense of what's in a string. cloc's method is approximate rather than perfect.

Having said that, I can think of a work-around although it is painful and unlikely to be worth the trouble. You could try to find files with such problematic strings using something like

find ./jdk -type f -name "*.java" -o -name "*.cpp" | xargs grep -l -P '"(http|https|file)://"' 

then use perl to modify the string by removing the inner /, ie,

perl -pi -e 's{"(http|https|file)://"}{"$1: /"}g'   list_of_files_matched_by_grep

then run cloc --strip-comments=BAK on this set and finally restore the missing / in this set:

perl -pi -e 's{"(http|https|file): /"}{"$1://"}g'   list_of_files_matched_by_grep

As I said though, that's pretty extreme. One would have to be quite desperate to bother with all that.