AlDanial / cloc

cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.
GNU General Public License v2.0
19.44k stars 1.02k forks source link

Complex regular subexpression recursion limit (32766) exceeded #167

Closed Christoph-Harms closed 6 years ago

Christoph-Harms commented 7 years ago

When scanning a large project:

bildschirmfoto 2017-02-18 um 13 08 40
AlDanial commented 7 years ago

Rerun with verbose level 3 or higher, -v 3, to identify the file which causes problems. Ideally you'd send me a copy of this file for troubleshooting.

Christoph-Harms commented 7 years ago

Reran it with -v 3, but wasn't able to spot the error message in the output. Unfortunately, I can't send you any files, since it's a closed-source project :-/

AlDanial commented 7 years ago

Troubleshooting this will be a challenge in this case. One approach is to create a list file then pass that in with --list-file, then bisect the file until you've found the one(s) causing the problem.

AlDanial commented 7 years ago

Regarding not seeing the warning message with -v 3, most likely it is because the warning is going to stderr and the regular output is going to stdout; if you redirected the output to a file the stderr content wouldn't appear there.

An easy way to fix that is to use the script program to capture all the output of a session. This is better than redirecting stderr to stdout via 2>&1 because that doesn't preserve the order of the outputs, while script's capture method does:

script debug_output.txt
cloc -v 2  ... # your additional args and inputs
exit           # to terminate script

then look for the recursion warning in debug_output.txt. That should pinpoint the file.

Christoph-Harms commented 7 years ago

Doing that, I was able to pinpoint the error to... me. ;) Well, sort of. Turns out I left all 3rd party dependencies in the project folder before clocing. Somethimes when cloc hits a Gruntfile or gulpfile, it results in this error.

Example:

rm_comments file=/path/to/node_modules/is_js/Gruntfile.js sub=call_regexp_common
Complex regular subexpression recursion limit (32766) exceeded at /usr/local/Cellar/cloc/1.72/libexec/bin/cloc line 7573.

Does that make any sense to you?

AlDanial commented 7 years ago

That's great that you were able to nail it down to one file. Have you tried running cloc on just it?
cloc /path/to/node_modules/is_js/Gruntfile.js Does that trigger the fault also? If yes, I sure would like a copy of this file. If you can't just post it, how about sanitizing it with dummy content, stripping out lines, etc, so long as the problem is still revealed?

dotchev commented 7 years ago

Had the same issue. Although I have deleted node_modules, turned out that some of our test apps had their own node_modules. Then found this option which solved the issue

cloc --vcs=git .

this tells it to ignore all files as defined in .gitignore

T27S commented 7 years ago

I think I have a different repro case for this issue; scrubbed file: RecursionLimit.zip

The bug was triggered by an isolated change; a method call that was:

By.TagName("td")

was changed to:

By.XPath("./*")

Guess: the /* inside the string literal is being picked up as a comment open.

Log:

-> call_counter(....\Source/redacted.cs, C#) -> read_file(....\Source/redacted.cs) <- read_file -> rm_blanks(language=C#) -> remove_matches_2re(pattern=^\s$,\$) <- remove_matches_2re <- rm_blanks(language=C#) -> rm_comments(file=....\Source/redacted.cs) rm_comments file=....\Source/redacted.cs sub=call_regexp_common -> call_regexp_common for C++ Complex regular subexpression recursion limit (32766) exceeded at script/cloc-1.72.pl line 9262. <- call_regexp_common -> remove_matches_2re(pattern=^\s$,\$) <- remove_matches_2re rm_comments file=....\Source/redacted.cs sub=remove_inline -> remove_inline(pattern=//.$) -> remove_matches_2re(pattern=^\s$,\$) <- remove_matches_2re <- rm_comments <- call_counter(40, 9, 0)

(Unrelated: is it odd that call_regexp_common is for C++ on C# files?)

AlDanial commented 7 years ago

@T27S thanks for your investigation and posting your findings. Yes, cloc's inability to recognize comment markers within strings has been a limitation for a long time (or, the problem is cloc's reliance on Regexp::Common which has this limitation). I don't know a good way to solve that. For starters I'm going to investigate doing the Perl equivalent of a try/except around calls to Regexp::Common to at least directly identify the problem files.

Re: C#, as far as I know, it has the same comment syntax as C++.

frankdugan3 commented 7 years ago

I've just had a similar issue. The code that triggered the limit was a .js file with the following line:

rm('-rf', distPath + '/*')

If I remove the *, the issue disappears. At least the count isn't affected by this; it's the same with or without the error-causing character combo.

pombredanne commented 7 years ago

I did hit the same issue:

./cloc-1.72.pl --by-file --csv --csv-delimiter="|" linux-4.13-rc3/ --out=../linux-4.13-rc3-cloc.csv
   60545 text files.
   60002 unique files.                                          
Complex regular subexpression recursion limit (32766) exceeded at ./../bin/cloc-1.72.pl line 9262.

when counting the code for the 4.13.-rc3 Linux kernel from https://git.kernel.org/torvalds/t/linux-4.13-rc7.tar.gz

Let me add a debug output and the exact file that triggered this

pombredanne commented 7 years ago

So with this file this can be reproduced nicely: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/arch/m68k/ifpsp060/src/isp.S?h=v4.13-rc3

./cloc-1.72.pl linux-4.13-rc3/arch/m68k/ifpsp060/src/isp.S
       1 text file.
       1 unique file.                              
Complex regular subexpression recursion limit (32766) exceeded at ./../bin/cloc-1.72.pl line 9262.
       0 files ignored.

github.com/AlDanial/cloc v 1.72  T=0.08 s (13.2 files/s, 56898.9 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Assembly                         1            672           1318           2309
-------------------------------------------------------------------------------
pombredanne commented 7 years ago

Yet, the total number of lines is indeed 4299 = 672 + 1318 + 2309

AlDanial commented 7 years ago

Thanks for simplifying my life by pointing directly to the problem file. I haven't investigated which lines within the file cause the problem but there's a simple fix--just change the sequence of comment filters for Assembly so that # lines are removed before calling the Regexp::Common::Comment C++ regex.

Commit 13e8f7b has this change, however, this is a game of whack-a-mole; just a matter of time before a file that has the opposite problem appears (that is, the new filter order causes the recursion limit warning and the previous one works fine).

AlDanial commented 7 years ago

Re-open this issue if the problem comes up again.

pombredanne commented 7 years ago

@AlDanial Thanks ++

pombredanne commented 7 years ago

@AlDanial could it be worth to add a test for this? To avoid it to regress the other way if another problematic file shows up later

zevlag commented 6 years ago

Can we reopen this please?

This file causes this error: https://github.com/PrismJS/prism/blob/v1.7.0/components/prism-kotlin.min.js

AlDanial commented 6 years ago

For reference, the line in prism-kotlin.min.js that causes the problem is 725, starts with <td id="LC1" class="blob-code blob-code-inner js-file-line">. cloc sees characters 741 and 742, which are /*, as the beginning of a comment which is never terminated.

cloc does not use grammars to parse the languages it counts; instead it uses relatively simple regex's. For this reason it is imperfect and vulnerable to failures such as this.

I'm open to suggestions on how to improve this situation.

lukaszgryglicki commented 6 years ago

It also happens for me when running on node_modules generated by yarn for https://github.com/cncf/landscape. This is bad because I wanted to compare then number of code specific to this repo (from src and tools) compared to the numbe rof code in dependencies.

AlDanial commented 6 years ago

I don't understand the node_modules and yarn part. I cloned landscape and counted its src and tools directories without issue though:

/tmp/landscape> cloc tools/
      38 text files.
      38 unique files.                              
       1 file ignored.

github.com/AlDanial/cloc v 1.77  T=0.03 s (1447.2 files/s, 98213.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
JavaScript                      37            173            118           2220
-------------------------------------------------------------------------------
SUM:                            37            173            118           2220
-------------------------------------------------------------------------------
/tmp/landscape> cloc src/
      87 text files.
      87 unique files.                              
      10 files ignored.

github.com/AlDanial/cloc v 1.77  T=0.19 s (403.2 files/s, 211065.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
JSON                             3              0              0          37179
JavaScript                      70            215             76           2565
Sass                             2             34              0            442
YAML                             1              0              0            207
EJS                              1              4              9             78
HTML                             1              0              0             18
-------------------------------------------------------------------------------
SUM:                            78            253             85          40489
-------------------------------------------------------------------------------

Can you give more insight into how I can reproduce the exact setup you have?

lukaszgryglicki commented 6 years ago

Try running yarn first - so it will download all dependenciess and then run clcoc on node_modules directory.

cc5350 commented 6 years ago

I'm seeing this problem in a JavaScript file. I cooked up a sanitized version of the file that demos the error (see below). As in other instances mentioned before, it's the "/*" that causes the error:

Complex regular subexpression recursion limit (32766) exceeded at /usr/local/bin/cloc line 9998.

I'm using cloc 1.76 on a CentOS 6 VM.

x = function(config) {
  config.set({
    files: [
        'test/*.js'
      ]
    , folderol: []
    , fiddledeedee: [ 'helloworld' ]
    , mary: 0000
    , hada: true
    , littlelamb: 'YYYYYYYXXXXX'
    , autoWatch: false
    , browsers: [ 'FooBarJS' ]
    , attentionSpan: 60000
    , singleTon: true
  });
};
AlDanial commented 6 years ago

@cc5350 : I'm not able to get the Complex regular subexpression warning using this input and the current development version of cloc. My Ubuntu system uses Perl v5.22.1 -- which Perl is on your CentOS VM?

AlDanial commented 6 years ago

Ref #245, try using the new --strip-str-comments switch added with 01ace37 in the development branch.