Verbatim strings produce incorrect results

Describe the bug Incorrect behavior when dealing with verbatim strings in languages such as C++, Python, and Rust

To Reproduce

Python:

text = r"xyz\\"
# hi

Output:

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Python                       1         2        0         0        2          0
───────────────────────────────────────────────────────────────────────────────
Total                        1         2        0         0        2          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $39
Estimated Schedule Effort 0.325843 months
Estimated People Required 0.014395
───────────────────────────────────────────────────────────────────────────────

C++:

/* 46 lines 37 code 3 comments 6 blanks */

#include <stdio.h>

// bubble_sort_function
void bubble_sort(int a[10], int n) {
  int t;
  int j = n;
  int s = 1;
  while (s > 0) {
    s = 0;
    int i = 1;
    while (i < j) {
      if (a[i] < a[i - 1]) {
        t = a[i];
        a[i] = a[i - 1];
        a[i - 1] = t;
        s = 1;
      }
      i++;
    }
    j--;
  }
}

int main() {
  int a[] = {4, 65, 2, -31, 0, 99, 2, 83, 782, 1};
  int n = 10;
  int i = 0;

  printf(R"(Before sorting:\n\n" )");
  // Single line comment
  while (i < n) {
    printf("%d ", a[i]);
    i++;
  }

  bubble_sort(a, n);

  printf("\n\nAfter sorting:\n\n");
  i = 0;
  while (i < n) {
    printf("%d ", a[i]);
    i++;
  }
}

This is the example Tokei uses. Output:

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                          1        46        4         2       40          3
───────────────────────────────────────────────────────────────────────────────
Total                        1        46        4         2       40          3
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $919
Estimated Schedule Effort 1.076760 months
Estimated People Required 0.101203
───────────────────────────────────────────────────────────────────────────────

Rust:

// 41 lines 33 code 3 comments 5 blanks

/* /**/ */
fn main() {
    let start = r##"/*##\"
\"##;
    // comment
    loop {
        if x.len() >= 2 && x[0] == '*' && x[1] == '/' { // found the */
            break;
        }
    }
}

fn foo<'a, 'b>(name: &'b str) {
    let this_ends = "a \"test/*.";
    call1();
    call2();
    let this_does_not = /* a /* nested */ comment " */
        "*/another /*test
            call3();
            */";
}

fn foobar() {
    let does_not_start = // "
        "until here,
        test/*
        test"; // a quote: "
    let also_doesnt_start = /* " */
        "until here,
        test,*/
        test"; // another quote: "
}

fn foo() {
    let a = 4; // /*
    let b = 5;
    let c = 6; // */
}

This is the example Tokei uses. Output:

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                         1        41        3         6       32          0
───────────────────────────────────────────────────────────────────────────────
Total                        1        41        3         6       32          0
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $727
Estimated Schedule Effort 0.985035 months
Estimated People Required 0.087520
───────────────────────────────────────────────────────────────────────────────

Expected behavior Python: 2 lines 1 comment 1 code 0 blanks C++: 46 lines 3 comment 37 code 6 blanks Rust: 41 lines 33 code 3 comments 5 blanks

This is the default case, as far as I know for C++ verbatim strings, because C++ verbatim strings can have prefixes.

Rust supports the pattern of r(#+)"..."(#+) where the capture groups must have equal length.

Desktop (please complete the following information):

OS: Linux
Version: 2.12.0

Love the project :heart:

Thanks for reporting. I think it should be just a matter of adding them to the verbatim string logic, but the test cases make it much easier to verify.

Man that c++ file is screwy...

# bb100123 @ VCOANSYD256197 in ~ [11:35:34]
$ gcc test.cpp
test.cpp:31:10: error: use of undeclared identifier 'R'
  printf(R"(Before sorting:\n\n" )");
         ^
test.cpp:31:35: warning: missing terminating '"' character [-Winvalid-pp-token]
  printf(R"(Before sorting:\n\n" )");
                                  ^
1 warning and 1 error generated.

# bb100123 @ VCOANSYD256197 in ~ [11:35:45] C:1
$ clang test.cpp
test.cpp:31:10: error: use of undeclared identifier 'R'
  printf(R"(Before sorting:\n\n" )");
         ^
test.cpp:31:35: warning: missing terminating '"' character [-Winvalid-pp-token]
  printf(R"(Before sorting:\n\n" )");
                                  ^
1 warning and 1 error generated.

Is there some trick to get that to work? The issue I see with it is,

printf(R"(Before sorting:\n\n" )");

Where it looks like the string ends after the \n\n as its not escaped?

@boyter Huh, the code didn't have any sort of warnings or compile time errors when I checked them

g++ (GCC) 10.1.0
clang version 10.0.0

Raw strings were added in C++11, if that helps in anyway.

In Tokei, since we can't really cover the prefix being anything without bringing in the entire regex crate, only handles the base case in the above example. Instead of checking for a closing ", the new closing quote needs to be )".

I think adding

"C++": {
    "complexitychecks": [
      "for ",
      "for(",
      "if ",
      "if(",
      "switch ",
      "while ",
      "else ",
      "|| ",
      "&& ",
      "!= ",
      "== "
    ],
    "extensions": [
      "cc",
      "cpp",
      "cxx",
      "c++",
      "pcc"
    ],
    "line_comment": [
      "//"
    ],
    "multi_line": [
      [
        "/*",
        "*/"
      ]
    ],
    "quotes": [
      {
        "end": "\"",
        "start": "\""
      },
      {
        "end": ")\"",
        "start": "\"(",
        "ignoreEscape": true,
      }
    ]
  },

Ideally Scc instead of interpreting the end and closing as simply as Tokei does, to use regular expressions to be able to handle the more abstract and crazy cases of C++ Raw Strings (small performance hit, but the regex is simple so most likely minimal), then storing the prefix in the state machine as the closing syntax of the string literal. This would also allow Scc to handle the (iirc infinite) possible #s for the Rust raw strings as well.

Maybe an additional field in quotes stating whether the end and start field are regular expressions to optimize that a bit. :smile:

Yeah that would explain it. Hmmm makes it a bit harder to deal with then.

I may have to go back and redo the whole matching engine actually since its rather complex and fragile at this point. That might tie into some performance gains to be gotten from a specific state machine for the more common languages staying out of that logic where possible.

There's a similar issue with literal characters. A contrived trivial example:

#include <iostream>

int main(int argc, char *argv[])
{
  std::cout << '"';
  /*  block comment
      inside code */
}

//----------------------------------------------------------------------------
//!
//----------------------------------------------------------------------------
/*  
 *
 *
 *
 */

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                          1        17        1         0       16          0
───────────────────────────────────────────────────────────────────────────────
Total                        1        17        1         0       16          0
───────────────────────────────────────────────────────────────────────────────

The example is contrived, but it's fairly common to see this sort of thing in C++ code.

Yeah there are a heap of edge cases with this. I think that as part of the performance tweaks I am considering having bespoke parser's to speedup and catch these edge cases might be the best solution.

boyter / scc

Verbatim strings produce incorrect results #175