Higlighting issue with strings & comments

apfelbox commented 10 years ago

This is the combined issue for all related issues (#187, #183, #155, #153).

There is a fight between the highlighting of strings vs. comments. Currently only one of them can work

// this is "a string"
"string with /* comment";

I will take a look whether this issue can be "fixed" (it can't be fixed for all cases due to the limited power of regular expressions though).

aviaryan commented 10 years ago

I have used SyntaxHighlighter in the past which doesn't have this problem. It would be a good idea to look at its core.

aviaryan commented 10 years ago

Maybe this can be solved by using lookbehind and then looking for even number of quotes (") before the comment character. I am trying this for AutoHotkey, doesn't work currently..

'comment': {
        pattern: /(("[^"]*"[^"]*)*)(\s;[^\n]*|^\s*\/\*[\s\S]*\s*\*\/)/gm,
        lookbehind: true
    },

aviaryan commented 10 years ago

It works now :-)

(^[^"]*("[^"]*"[^"]*)*)

("[^"]*"[^"]*)* was not sufficient as it easily matched 0 matches because of the last * . Adding the extra, independent ^[^"]* forces it to match something from the start of the string and thus fail in case even number of quotes (") are not available.

apfelbox commented 10 years ago

This was my first thought, but I didn't have any chance to look at it yet.

How SyntaxHighlighter fixes it is by simple checking for nested matches and removing these: https://github.com/alexgorbatchev/syntaxhighlighter/blob/master/src/js/shCore.js#L1325

apfelbox commented 10 years ago

You should format your comment as code, otherwise it is not really readable :wink: (^[^"]*("[^"]*"[^"]*)*)

my only problem is that this regex is pretty slow.

LeaVerou commented 10 years ago

Yeah, I’d rather not slow down the general case to work around an edge case either…

apfelbox commented 10 years ago

Ok, so we basically have two issues:

performance impact of highly quantified regex ((x*x*)* is pretty much the negative example of Catastrophic backtracking)
The global check for " or ' doesn't work because we only match parts of the input and not the whole text everytime. Therefore we can't reliably count the number of quotes.

1. Performance

Let's take a look at some numbers.

The current regex for clike goes like this: /(^|[^\\])(\/\*[\w\W]*?\*\/|(^|[^:])\/\/.*?(\r?\n|$))/g

It does some pretty basic lookbehind testing to ensure that we match only if there is no \ right before the comment.

If we skip this requirement (to keep it simple and it doesn't cause a performance hit) we end up with: /(^[^"]*("[^"]*"[^"]*)*[^"\\]*)(\/\*[\w\W]*?\*\/|(^|[^:])\/\/.*?(\r?\n|$))/g

I will take the canonical example of a failing test case - the grunt config:

grunt.initConfig({
    sass: {
        dist: {
            files: [{
                test: 5, //a
                src: "scss/**/*.scss"
            }]
        }
    }
});

So, the current issue ("known failure") is that /**/ is matched as comment inside the string. So we can fix the issue with the improved regex above, but let's take a look at some numbers:

The first match is the first comment (//a, should match), the second match the "comment" inside the string ("../**/...", should not match).

the improved version: http://regex101.com/r/hP3rA0/1 MATCH 1 - FINISHED IN 8445 STEPS (match :white_check_mark:) MATCH 2 - FINISHED IN 3 STEPS (no match :white_check_mark:) Total: 8448 Steps (+ 614%)
the current version: http://regex101.com/r/oZ4iV2/1 MATCH 1 - FINISHED IN 895 STEPS (match :white_check_mark:) MATCH 2 - FINISHED IN 287 STEPS (match :heavy_exclamation_mark:) Total: 1182 Steps (+/- 0%)

That's a pretty heavy performance impact - and what is pretty ugly: it depends on the length of the input (so it might get even worse). Maybe there is a regex guru out there who optimizes the hell out of the regex - please step forward, we would all benefit from it. :wink:

2. Partial matches with global context

Well, this is the main problem. There is no global context available. You could - however - extend the logic so that still only the small part is matched, but there is a "global lookbehind" (= the input up to the currently matched part). This would be doable (I guess), although it doesn't strike me as very elegant.

3. Solution?

Well, I will try to look at the way SyntaxHighlighter has implemented it (by transforming the content in a "sort-of-AST" and optimizing it). I have the hopes up that this could be implementable without much effort and complication. But we will see when the prototype stands.

edit: typo in new regex fixed edit 2: relative performance calculated

aviaryan commented 10 years ago

Well, I will try to look at the way SyntaxHighlighter has implemented it (by transforming the content in a "sort-of-AST" and optimizing it). I have the hopes up that this could be implementable without much effort and complication.

That would be the best. Currently I feel that PrismJS locks the string which is highlighted once and then prevents it from matching successive tokens. Thus results are not what you expect.

A problem that I am stuck with in the last few days is purely because of this fact. I am in need to highlight the first word of the first line of code. highlight not_highlight no no no I can't find a way to highlight only the highlight word and leave the rest untouched. Using ^ to match the start of code matches not_highlight once highlight has been highlighted and this process continues. BTW, is there a google group/forum to discuss such things..

aviaryan commented 10 years ago

That's a pretty heavy performance impact - and what is pretty ugly: it depends on the length of the input (so it might get even worse).

I use a large 1600 lined code for testing my prism definition for autohotkey language. Before implementing the fix for this issue, it took about 4 seconds to load on my computer and now too takes almost 4 seconds. I know this is not precise, and so I tried to use regex101.com to measure the performance impact but I can't find where they display number of "STEPS".

apfelbox commented 10 years ago

Regarding your last question: in the left sidebar you need to click on Tools > regex debugger.

apfelbox commented 10 years ago

Ok, update. After sleeping a night over it I realized, that making all the quantified selectors in the lookbehind non-greedy could speed up the regex a lot, because we dramatically reduce the branching factor of the regex matching.

# old improved version
/(^[^"]*("[^"]*"[^"]*)*[^"\\]*)(\/\*[\w\W]*?\*\/|(^|[^:])\/\/.*?(\r?\n|$))/g

# new improved version
/(^[^"]*?("[^"]*?"[^"]*?)*?[^"\\]*?)(\/\*[\w\W]*?\*\/|(^|[^:])\/\/.*?(\r?\n|$))/g

improved & optimized solution: http://regex101.com/r/yS6dY5/1 MATCH 1 - FINISHED IN 660 Steps (match :white_check_mark:) MATCH 2 - FINISHED IN 3 STEPS (no match :white_check_mark:) Total: 663 Steps (- 44%)

While this looks promising we need to be very careful not to break existing language definitions.

PS: if this works out we can try to profile and possibly optimize regex'es of other existing language definitions.

apfelbox commented 10 years ago

PS2: we still need to fix the global lookbehind issue.

aviaryan commented 10 years ago

Thanks @apfelbox i tried the old and new pattern (see pull request) on this code

msgbox % ";str" " /*str"" " ;comment
msgbox % "Loading ;str" "" ;comment

Old - (38|65|4) - http://regex101.com/r/nY6iY0/1 New - (87|85|3) - http://regex101.com/r/sZ2yU2/3 New when compared to Old - 163%

Updated New - (76|73|3) - http://regex101.com/r/sZ2yU2/4 - 142% Updated New - (51|49|3) - http://regex101.com/r/sZ2yU2/5 - 97% :+1: :100:

aviaryan commented 10 years ago

PS2: we still need to fix the global lookbehind issue.

What is it ?

apfelbox commented 10 years ago

See 2. in https://github.com/LeaVerou/prism/issues/307#issuecomment-50788976

aviaryan commented 10 years ago

I have used the m modifier combined with anchor ^ and in AHK and this helps in not matching the whole input everytime. As I see, you don't use m in clike. Does C syntax doesn't allow that ? Update You should also try using [^"\n] in place of [^"] to prevent taking additional lines into account. Update [^something]* is pure disaster, it matches everything including \n

apfelbox commented 10 years ago

I know that, but there are several problems:

the current architecture of prism only matches parts of the string. So the part on which the regex matches is not always the complete input up to this point. You just can't reliably count the number of quotes up to this point then.
[^"\n] will fail for multiline strings.

"blabla
blabla // test"

Your regex won't match that.

aviaryan commented 10 years ago

[^"\n] will fail for multiline strings.

Oh sorry. Multi-line strings are not there in AutoHotkey. I should have appended "in case C syntax allows that".

the current architecture of prism only matches parts of the string. So the part on which the regex matches is not always the complete input up to this point. You just can't reliably count the number of quotes up to this point then.

What if I keep the comments regex right at the top.Then the whole of haystack will be available as input.

apfelbox commented 10 years ago

Oh sorry. Multi-line strings are not there in AutoHotkey. I should have appended "in case C syntax allows that".

There are languages which extend clike that don't allow them (javascript for example), but for now we just need to fix it for all clike cases.

If your language just allows single linge comments it is perfectly fine to do it with the regex - there won't be any problems and by anchoring the regex in the line with ^ it will be pretty fast.

What if I keep the comments regex right at the top.Then the whole of haystack will be available as input.

This is true. I will evaluate all possible solutions in the coming time, so maybe I can tell a bit more about it in the future.

zeitgeist87 commented 10 years ago

This is just a joke, so don't take it too seriously :). But it does actually work and it solves most of the above problems. I don't know how good the performance is though. I use a fake RegExp object and do some extra checking. It handles single quotes, regexes, double quotes and escaped quotes. I don't think your proposed solutions can do that.

Prism.languages.clike = {
    'comment':{
        pattern: {
            pattern: /(?:\/\/.*|\/\*[\w\W]*?\*\/)/g,
            exec: function(str) {
                var p = this.pattern,
                    match;

                p.lastIndex = 0;
                match = p.exec(str);
                while (match) {
                    var inDoubleQuote = false,
                        inSingleQuote = false,
                        inRegex = false,
                        isEscaped = false;

                    for (var i = 0; i < match.index; ++i) {
                        switch(str[i]) {
                            case '"':
                                inDoubleQuote = (!inDoubleQuote || isEscaped) &&
                                                !inSingleQuote &&
                                                !inRegex;
                                break;
                            case '\'':
                                inSingleQuote = !inDoubleQuote &&
                                                (!inSingleQuote || isEscaped) &&
                                                !inRegex;
                                break;
                            case '/':
                                inRegex = !inDoubleQuote &&
                                            !inSingleQuote &&
                                            (!inRegex || isEscaped);
                                break;
                            case '\n':
                                inRegex = false;
                                break;
                            case '\\':
                                isEscaped = !isEscaped;
                                break;
                            default:
                                isEscaped = false;
                        }
                    }

                    if (!inDoubleQuote && !inSingleQuote && !inRegex)
                        return match;

                    p.lastIndex = match.index + 1;
                    match = p.exec(str);
                }
                return match;
            }
        }
    },
...

LeaVerou commented 10 years ago

That looks like a parser of sorts. Of course it would be correct, but I doubt the performance will be decent, especially for larger snippets.

foo123 commented 8 years ago

Check prism-grammar external parsing add-on for Prism, which uses a grammar to highlight solving all these issues (and more)

WebReflection commented 8 years ago

chiming in with a proof of concept that I've no idea if is worth exploring.

Basically, it removes strings, highlight the content, then it puts strings back.

function safeHighlight(content, highgligh) {
  var
    uniqueID = '_@no\x01shenanigans@_',
    restore = new RegExp('(["\'])' + uniqueID + '(.+?)\\1', 'g'),
    str = []
  ;
  function pre(content) {
    return content.replace(
      /(["'])(?:(?=(\\?))\2.)*?\1/g,
      ($0, $1) => $1 + uniqueID + (str.push($0.slice(1, -1)) - 1) + $1
    );
  }
  function post(content) {
    return content.replace(
      restore,
      ($0, $1, $2) => $1 + str[$2] + $1
    );
  }
  return post(highgligh(pre(content)));
}

var content = 'var str = "hello/**/world//!";';

// use prism or the actual highlighter instead of String
safeHighlight(content, String);

I haven't done many benchmarks but it seems like a simple/safe solution for strings in comments and/or in code: it will put them back in their context, if that was highlighted, strings will result highlighted too.

Best Regards

[edit] it might be worth using a unique token to avoid conflicts with generated html so that any <span atribute="23"> won't be injected by the replacer [edit 2] I've updated the code with the better placeholder

zeitgeist87 commented 8 years ago

@WebReflection your solution would certainly work for strings, although it seems more like a hack. Do you have any proposal for turning it into a general purpose solution?

We have recently merged a partial solution for that problem #779. It can correctly handle strings with one comment inside but not more (e.g.: "hello/**/world!")

WebReflection commented 8 years ago

well, the general purpose solution is what you call the hack:

find all strings and put a string placeholder
parse everything without problems and ignoring all strings content
put strings content back using the placeholder

A char-by-char parser wouldn't have this problem because it would know it's inside a string ... my work-around does exactly the same making the "inside a string" a no-brainer for the parser.

I could write a more robust example with a better placeholder but I honestly don't know how to improve every language parsed by prism

[edit] this would be my best pick

var safeHighlight = (function () {
  var
    uid = '_@no\x01shenanigans@_',
    find = /(["'])(?:(?=(\\?))\2.)*?\1/g,
    restore = new RegExp('(["\'])' + uid + '(\\d+)\\1', 'g'),
    rp1 = function ($0, $1) {
      return $1 + uid + (str.push($0.slice(1, -1)) - 1) + $1;
    },
    rp2 = function ($0, $1, $2) {
      return $1 + str[$2] + $1;
    },
    str
  ;
  return function safeHighlight(content, highgligh) {
    str = [];
    console.log(restore);
    var out = highgligh(content.replace(find, rp1)).replace(restore, rp2);
    str = [];
    return out;
  };
}());

zeitgeist87 commented 8 years ago

well, the general purpose solution is what you call the hack:

find all strings and put a string placeholder parse everything without problems and ignoring all strings content put strings content back using the placeholder

I understand how it works and it is a nice solution. With general purpose solution I mean something that works for any grammar and not only for strings but also for other stuff. This problem is not limited to strings. For example you can have a regex with strings inside /something"foo"something"bar"/

WebReflection commented 8 years ago

For example you can have a regex with strings inside /something"foo"something"bar"/

not an issue, regexp will be highlighted as such, and strings will be put back. So this: /something"foo"something"bar"/ will become /something"_@no\x01shenanigans@_"something"_@no\x01shenanigans@_"/ and then it will become again the original string.

TL;DR it should work universally but the order obviously matters.

literal RegExp don't suffer the code bug because both multiline and single line needs to be escaped, so comments first
regular expressions would be second and the same trick could be done for literal RegExp too ... you remove strings first, then literal RegExp, then you parse
string parsing at the end because it's now safer, easier

This would be my order/solution ... but yet it's crafted for JS comments and JS strings.

WebReflection commented 8 years ago

to better explain:

pre process dropping strings
process processed outcome finding comments, literal regexp, rest of the code, strings
put all strings back in place so those within comments will be comments, those within regexp will be regexp, those within strings will be strings

If needed, drop strings entirely (remove $1 too and return just a placeholder) and put them back at the end using a restore callback that put back them highlighted.

Does this make sense?

foo123 commented 8 years ago

Since i happen to get notified in this discussion, let me enter my 2 cents

Yes using placeholders partialy solves the problem, so in this sense it is a hack, a nice hack though.

For example multiply escaped strings, inside strings e.g "a string with \\\\\\"escaped string\\\\\\\\\\" inside" (used in javascript) will fail, and so on.. A major problem is escaping, which needs char-by-char parsing (at least in such cases) since regexps cannot count parentheses, escapes and so on..

Then other combinations, e.g strings inside comments, or vice-versa, or comments inside regexps and so on.. Or tags inside CDATA and so on..

So this is not a good solution, only partial and even this in some cases.

i still suggest using a special (delimited) tokenizer which handles only delimited tokens (e.g comments, strings, regexps, cdata and what not, which are all over the place). it is not difficult, it is not slower (in fact it will be faster) and is actually intuitive to define in a specification e.g

"comment" : ["/*","*/"],
"string": ["'","'","\\"],
"regexp": ["/","/","\\"]
// and so on..

Cheers

zeitgeist87 commented 8 years ago

i still suggest using a special (delimited) tokenizer which handles only delimited tokens (e.g comments, strings, regexps, cdata and what not, which are all over the place). it is not difficult, it is not slower (in fact it will be faster) and is actually intuitive to define in a specification e.g

I've already implemented exactly that some time ago, but the code size got too big. The problem is that prism-core.js has to stay small and light-weight, because that is one of the major features of Prism.

zeitgeist87 commented 8 years ago

Does this make sense?

Well it seems to get quite complicated. The constraints for performance and code size are quite challenging here. We already had a few proposed solutions that didn't cut the mustard. Nevertheless I am happy to test and review any proof of concept implementations...

WebReflection commented 8 years ago

@foo123 your example is broken, while this string doesn't create a SyntaxError: "a string with \\\\\\\"escaped string\\\\\\\\\\\" inside"

Moreover, that's not an issue at all with the RegExp I've used.

var code = 'var s = "a string with \\\\\\\"escaped string\\\\\\\\\\\" inside";';
code.replace(/(["'])(?:(?=(\\?))\2.)*?\1/g, '$1hello$1');
// "var s = "hello";"

So I'm not sure what are your concerns about correctly escaped strings.

For non correctly escaped strings though, well ... that's a "who cares" case because the code is broken and cannot be possibly highlighted correctly, right?

foo123 commented 8 years ago

The fact remains that regexps cannot count (e.g number of parentheses, or number of escapes and so on..) So if they cannot count, all delimited tokens that can be escaped , meaning they can contain their own end-pattern if it is escaped in a perfectly valid way, cannot be parsed correctly with regexps. Maybe i phrased it better now.

WebReflection commented 8 years ago

I agree char-by-char is a rock-solid solution but still I'd like to underline the RegExp I'm using works with escaped strings, no matter how many escapes, because the RegExp I'm using does count escapes.

It's not a universal solution, I agree, but it works with strings and with literal RegExp.

Here an improved example that make the intermediate content easily parsable via prism

var safeHighlight = (function (Prism) {
  // v3 without shenanigans
  var
    html = {
      chars: {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;',
        "'": '&#39;',
        '"': '&quot;'
      },
      escape: function (content) {
        return content.replace(html.find, html.replace);
      },
      find: /[&<>'"]/g,
      replace: function ($0) {
        return html.chars[$0];
      }
    },
    uid = (Math.random() + '').slice(2),
    uidFinder = uidFinder = new RegExp(
      '(?:<[^>]+?>)?' +
        uid + '\\.(\\d+)' +
      '(?:<\\/[^>]+?>)?', 'g'
    ),
    uidReplacer = function ($0, $1) {
      return html.escape(chunks[$1]);
    },
    commonStringRegExp =
      /(["'])(?:(?=(\\?))\2.)*?\1/g,
    commonReplacer = function ($0, $1) {
      var len = $1.length;
      return $1 + uid + '.' + (
        chunks.push($0.slice(len, -len)) - 1
      ) + $1;
    },
    languageWrap = {
      javascript: {
        find: [
          commonStringRegExp,
          /(\/)[^/*](?:(?=(\\?))\2.)*\1([^/=*]|$)/g
        ],
        replace: [
          commonReplacer,
          function ($0, $1, $2, $3) {
            var len = $1.length;
            return $1 + uid + '.' + (
              chunks.push($0.slice(len, -(len + $3.length))) - 1
            ) + $1 + $3;
          }
        ]
      },
      python: {
        find: [commonStringRegExp],
        replace: [commonReplacer]
      },
      ruby: {
        find: [commonStringRegExp],
        replace: [commonReplacer]
      }
    },
    langMap = new Map(
      Object.keys(Prism.languages).map(function (key) {
        return [Prism.languages[key], key];
      })
    ),
    chunks
  ;

  // some language has a shortcut equivalent.
  languageWrap.js = languageWrap.javascript;
  languageWrap.py = languageWrap.python;

  return function safeHighlight(content, language) {
    var
      languageName = langMap.get(language) || '',
      wrap = languageWrap.hasOwnProperty(languageName) ?
        languageWrap[languageName] : null,
      out = content,
      i = 0
    ;
    chunks = [];
    if (wrap) {
      while (i < wrap.find.length) {
        out = out.replace(wrap.find[i], wrap.replace[i]);
        i++;
      }
    }
    out = Prism.highlight(out, language);
    if (wrap) {
      while (i--) out = out.replace(uidFinder, uidReplacer);
    }
    chunks = [];
    return out;
  };
}(Prism));

var code = `
  /* this is a multiline comment */
  var s = "a string with \\\\\\\"escaped string\\\\\\\\\\\" inside";
  var c = "a string with /*ml comment*/ and a //sl too";
  // this is a single line comment
  var re = /this is a re" with a string "inside/g;
  var rec = /this is a re with a [^/*ml*/] inside/g;
`;

safeHighlight(code, Prism.languages.js);

How does the intermediate code look like?

/* this is a multiline comment */
var s = "123123882183.0";
var c = "123123882183.1";
// this is a single line comment
var re = /123123882183.2/g;
var rec = /123123882183.3/g;

Is that middle-content easy to parse for prism? I think so.

here a jsbin

foo123 commented 8 years ago

Yes having the regexp take account of any preceding escapes does fix the thing, however i;m not sure what would happen with multiple escaped strings inside a single string, does it work the same? It should be tested. true not usual case, but lets do it for science :)

Btw even ace editor (if i'm not mistaken) which github uses to highlight code has a hard time with these things even though it uses a state machine for parsing tokens (ace javascript mode)

WebReflection commented 8 years ago

dunno, what I think is good, is that I've created a jsbin for this case.

At least, if nothing, there is a playground to test nasty code and see how to proceed.

There are surely edge cases, but since edge cases are a "won't fix" here, due nature of the parser which is RegExp based and, as such, cannot be perfect, I guess at least reducing them would be already a good achievement.

Hopefully worth exploring.

zeitgeist87 commented 8 years ago

@WebReflection Your expanded solution works as a proof of concept, but you have to fully integrate it into prism-core.js, which I believe is possible and could work. However I am still skeptical if it would be any better than the solution we already have. The input for your function must be an arbitrary Prism grammar for an arbitrary language. It shouldn't be limited to JavaScript.

Also the string inside of a regex example var re = /this is a re" with a string "inside/g; doesn't work, because placeRe encodes the " characters to ". Therefore they cannot be found any more by the restoreStr regex.

There are surely edge cases, but since edge cases are a "won't fix" here, due nature of the parser which is RegExp based and, as such, cannot be perfect, I guess at least reducing them would be already a good achievement.

We have recently merged a general purpose, small solution with minimal performance impact, which reduces these edge cases significantly #779.

WebReflection commented 8 years ago

nice catch about the regExp, it's because I've adde the HTML escaper without retesting (stupid me).

I understand it should work as general purpose solution but I don't think it will be accepted here (because of skepticism) so I won't bother with a PR but I'll fix my proof of concept.

If there is real interest in having it in I'll try to improve the proof of concept and submit a PR with the feature in.

WebReflection commented 8 years ago

FYI I've fixed the issue with strings in regexps (with strings in general) https://jsbin.com/qazida / @zeitgeist87

LeaVerou commented 8 years ago

The problem with removing strings and using placeholders is that it doesn't handle all other kinds of nested tokens, and if we handle those too we're basically parsing the input, so we may as well do some proper lightweight parsing first. I wonder how much that would slow us down, actually. We could have a special kind of token that is nested (has a start and an end delimiter), then go over the input once and figure out what the groups are. Done right it might be a workable solution, perhaps.

WebReflection commented 8 years ago

I'm not sure you should handle any nested token within single or double quoted strings but I've improved the example with better code, Python and Ruby compatible too (easy to make it safer with others) and improved a bit in terms of API https://github.com/PrismJS/prism/issues/307#issuecomment-205453544

The previous version was suffering nested tokens indeed, now in the worst case scenario you gonna have a number highlight instead of string.

Still not perfect but we all know with RegExp only it's hard to be so :-)

Golmote commented 8 years ago

This is mostly fixed by the greedy feature (#779) and its improvement (#967). I'm closing the issue.

ziagit commented 4 years ago

Hi there, I'm using Prism with Laravel, The problem I face; when I have something like {{myText}} in my HTML file, this string interpolation cause, not to show the whole page,

please help thanks

PrismJS / prism