boyter / scc

Sloc, Cloc and Code: scc is a very fast accurate code counter with complexity calculations and COCOMO estimates written in pure Go
MIT License
6.48k stars 254 forks source link

Verbatim strings produce incorrect results #175

Open NickHackman opened 4 years ago

NickHackman commented 4 years ago

Describe the bug Incorrect behavior when dealing with verbatim strings in languages such as C++, Python, and Rust

To Reproduce

Python:

text = r"xyz\\"
# hi

Output:

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Python                       1         2        0         0        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         2        0         0        2          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $39
Estimated Schedule Effort 0.325843 months
Estimated People Required 0.014395
───────────────────────────────────────────────────────────────────────────────

C++:

/* 46 lines 37 code 3 comments 6 blanks */

#include <stdio.h>

// bubble_sort_function
void bubble_sort(int a[10], int n) {
  int t;
  int j = n;
  int s = 1;
  while (s > 0) {
    s = 0;
    int i = 1;
    while (i < j) {
      if (a[i] < a[i - 1]) {
        t = a[i];
        a[i] = a[i - 1];
        a[i - 1] = t;
        s = 1;
      }
      i++;
    }
    j--;
  }
}

int main() {
  int a[] = {4, 65, 2, -31, 0, 99, 2, 83, 782, 1};
  int n = 10;
  int i = 0;

  printf(R"(Before sorting:\n\n" )");
  // Single line comment
  while (i < n) {
    printf("%d ", a[i]);
    i++;
  }

  bubble_sort(a, n);

  printf("\n\nAfter sorting:\n\n");
  i = 0;
  while (i < n) {
    printf("%d ", a[i]);
    i++;
  }
}

This is the example Tokei uses. Output:

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                          1        46        4         2       40          3
───────────────────────────────────────────────────────────────────────────────
Total                        1        46        4         2       40          3
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $919
Estimated Schedule Effort 1.076760 months
Estimated People Required 0.101203
───────────────────────────────────────────────────────────────────────────────

Rust:

// 41 lines 33 code 3 comments 5 blanks

/* /**/ */
fn main() {
    let start = r##"/*##\"
\"##;
    // comment
    loop {
        if x.len() >= 2 && x[0] == '*' && x[1] == '/' { // found the */
            break;
        }
    }
}

fn foo<'a, 'b>(name: &'b str) {
    let this_ends = "a \"test/*.";
    call1();
    call2();
    let this_does_not = /* a /* nested */ comment " */
        "*/another /*test
            call3();
            */";
}

fn foobar() {
    let does_not_start = // "
        "until here,
        test/*
        test"; // a quote: "
    let also_doesnt_start = /* " */
        "until here,
        test,*/
        test"; // another quote: "
}

fn foo() {
    let a = 4; // /*
    let b = 5;
    let c = 6; // */
}

This is the example Tokei uses. Output:

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                         1        41        3         6       32          0
───────────────────────────────────────────────────────────────────────────────
Total                        1        41        3         6       32          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $727
Estimated Schedule Effort 0.985035 months
Estimated People Required 0.087520
───────────────────────────────────────────────────────────────────────────────

Expected behavior Python: 2 lines 1 comment 1 code 0 blanks C++: 46 lines 3 comment 37 code 6 blanks Rust: 41 lines 33 code 3 comments 5 blanks

This is the default case, as far as I know for C++ verbatim strings, because C++ verbatim strings can have prefixes.

Rust supports the pattern of r(#+)"..."(#+) where the capture groups must have equal length.

Desktop (please complete the following information):

Love the project :heart:

boyter commented 4 years ago

Thanks for reporting. I think it should be just a matter of adding them to the verbatim string logic, but the test cases make it much easier to verify.

boyter commented 4 years ago

Man that c++ file is screwy...

# bb100123 @ VCOANSYD256197 in ~ [11:35:34]
$ gcc test.cpp
test.cpp:31:10: error: use of undeclared identifier 'R'
  printf(R"(Before sorting:\n\n" )");
         ^
test.cpp:31:35: warning: missing terminating '"' character [-Winvalid-pp-token]
  printf(R"(Before sorting:\n\n" )");
                                  ^
1 warning and 1 error generated.

# bb100123 @ VCOANSYD256197 in ~ [11:35:45] C:1
$ clang test.cpp
test.cpp:31:10: error: use of undeclared identifier 'R'
  printf(R"(Before sorting:\n\n" )");
         ^
test.cpp:31:35: warning: missing terminating '"' character [-Winvalid-pp-token]
  printf(R"(Before sorting:\n\n" )");
                                  ^
1 warning and 1 error generated.

Is there some trick to get that to work? The issue I see with it is,

printf(R"(Before sorting:\n\n" )");

Where it looks like the string ends after the \n\n as its not escaped?

NickHackman commented 4 years ago

@boyter Huh, the code didn't have any sort of warnings or compile time errors when I checked them

g++ (GCC) 10.1.0
clang version 10.0.0 

Raw strings were added in C++11, if that helps in anyway.

In Tokei, since we can't really cover the prefix being anything without bringing in the entire regex crate, only handles the base case in the above example. Instead of checking for a closing ", the new closing quote needs to be )".

I think adding

"C++": {
    "complexitychecks": [
      "for ",
      "for(",
      "if ",
      "if(",
      "switch ",
      "while ",
      "else ",
      "|| ",
      "&& ",
      "!= ",
      "== "
    ],
    "extensions": [
      "cc",
      "cpp",
      "cxx",
      "c++",
      "pcc"
    ],
    "line_comment": [
      "//"
    ],
    "multi_line": [
      [
        "/*",
        "*/"
      ]
    ],
    "quotes": [
      {
        "end": "\"",
        "start": "\""
      },
      {
        "end": ")\"",
        "start": "\"(",
        "ignoreEscape": true,
      }
    ]
  },

Ideally Scc instead of interpreting the end and closing as simply as Tokei does, to use regular expressions to be able to handle the more abstract and crazy cases of C++ Raw Strings (small performance hit, but the regex is simple so most likely minimal), then storing the prefix in the state machine as the closing syntax of the string literal. This would also allow Scc to handle the (iirc infinite) possible #s for the Rust raw strings as well.

Maybe an additional field in quotes stating whether the end and start field are regular expressions to optimize that a bit. :smile:

boyter commented 4 years ago

Yeah that would explain it. Hmmm makes it a bit harder to deal with then.

I may have to go back and redo the whole matching engine actually since its rather complex and fragile at this point. That might tie into some performance gains to be gotten from a specific state machine for the more common languages staying out of that logic where possible.

dwmcrobb commented 3 years ago

There's a similar issue with literal characters. A contrived trivial example:

#include <iostream>

int main(int argc, char *argv[])
{
  std::cout << '"';
  /*  block comment
      inside code */
}

//----------------------------------------------------------------------------
//!
//----------------------------------------------------------------------------
/*  
 *
 *
 *
 */
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                          1        17        1         0       16          0
───────────────────────────────────────────────────────────────────────────────
Total                        1        17        1         0       16          0
───────────────────────────────────────────────────────────────────────────────

The example is contrived, but it's fairly common to see this sort of thing in C++ code.

boyter commented 3 years ago

Yeah there are a heap of edge cases with this. I think that as part of the performance tweaks I am considering having bespoke parser's to speedup and catch these edge cases might be the best solution.