Closed cchaniotaki closed 1 year ago
Unfortunately there's no straightforward way (at least I can't think of one) to solve this with a regex-based tool such as cloc. Only a true language parser can make sense of what's in a string. cloc's method is approximate rather than perfect.
Having said that, I can think of a work-around although it is painful and unlikely to be worth the trouble. You could try to find files with such problematic strings using something like
find ./jdk -type f -name "*.java" -o -name "*.cpp" | xargs grep -l -P '"(http|https|file)://"'
then use perl to modify the string by removing the inner /
, ie,
perl -pi -e 's{"(http|https|file)://"}{"$1: /"}g' list_of_files_matched_by_grep
then run cloc --strip-comments=BAK
on this set and finally restore the missing /
in this set:
perl -pi -e 's{"(http|https|file): /"}{"$1://"}g' list_of_files_matched_by_grep
As I said though, that's pretty extreme. One would have to be quite desperate to bother with all that.
When a file has a string with // inside (eg. HTTP://, file:// etch.) then the clock removes it because it thinks that is a comment (in languages that // is a comment). For example, I ran it on jdk and on file "TestFinalizerStatisticsEvent.java" (and other files also) if you run it, you will see the bug.
To Reproduce Run:
cloc ./jdk --skip-uniqueness --processes=8 --strip-comments=BAK --original-dir
On .BAK files you will see the problem if they have a string with // on it.
Before removing comments
After removing comments
Languages that I have notices this problem Java, C, C++, C#