doddoreul / google-code-prettify

Automatically exported from code.google.com/p/google-code-prettify
Apache License 2.0
0 stars 0 forks source link

Highlighting fails if pattern ends with a lookahead assertion #187

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

Consider this handler for a language consisting of keywords, types, and 
literals:

  PR.registerLangHandler(
    PR.createSimpleLexer(
      [], [
        // Keywords are letters that are followed by a number
        [PR.PR_KEYWORD, /^[A-Za-z]+(?=\d)/],
        // Other letters designate types
        [PR.PR_TYPE, /^[A-Za-z]+/],
        // Numbers are literals, not part of the keywords
        [PR.PR_LITERAL, /^\d+/]
      ]
    ),
    ['test']
  );

Apply it to this code block:

  <pre class="prettyprint lang-test" id="test_lang">
  keyword123type
  type keyword3
  </pre>

What is the expected output?  What do you see instead?

Expected:
  '`KEYkeyword`END`LIT123`END`TYPtype`END`PLN\n' +
  '`END`TYPtype`END`PLN `END`KEYkeyword`END`LIT3`END'

Actual:
  '`TYPkeyword`END`LIT123`END`TYPtype`END`PLN\n' +
  '`END`TYPtype`END`PLN `END`TYPkeyword`END`LIT3`END'

That is, the keywords are highlights as types. Types and literals are 
highlighted correctly, though.

What version are you using?  On what browser?

I observe this on SVN revisions 176 and 194 on Chrome 17 and earlier.

Please provide any additional information below.

When the pattern ends with a lookahead assertion, that assertion is correctly 
used to tokenize the input string in the `decorate` function:

  var tokens = sourceCode.match(tokenizer) || [];

However, the result of tokenization returns *just* the tokens. In the example 
above, the first item in the token array is 'keyword'. Later, that's tested 
against each of the patterns to see which style should be applied to it:

  for (var i = 0; i < nPatterns; ++i) {
    patternParts = fallthroughStylePatterns[i];
    match = token.match(patternParts[1]);

Since the token doesn't contain the trailing context (the '1' after 'keyword'), 
it doesn't match the same pattern anymore, so it gets the wrong style.

Other uses of lookaheads in the Prettify source also accept $, so they still 
match the token even when taken out of context; they're used as a custom 
version of \b. In my example, though, that won't work. A string of letters at 
the end of the input (not followed by any numbers) should be classified as a 
type, not a keyword.

Original issue reported on code.google.com by kennedyri on 31 Jan 2012 at 11:34

GoogleCodeExporter commented 8 years ago
The problem here is in the way the regular expressions are used.

First, all the regular expressions are combined together to create a lexical 
tokenizer.

/(?:[A-Za-z]+(?=\d)|(?:[A-Za-z]+)|\d+)/g

which is then globally matched against the string to produce an array of tokens.

Second, the regular expressions are used to classify each individual token.

Since the individual token does not contain the lookahead characters, only the 
second, less restrictive one matches: [A-Za-z]+

The solution is to use embedded sub-languages.

PR.registerLangHandler(
    PR.createSimpleLexer(
      [], [
        // Keywords are letters that are followed by a number
        ['lang-test-kw', /^([A-Za-z]+)\d+/],
        // Other letters designate types
        [PR.PR_TYPE, /^[A-Za-z]+/],
        // Numbers are literals, not part of the keywords
        [PR.PR_LITERAL, /^\d+/]
      ]
    ),
    ['test']
  );

PR.registerLangHandler(
    PR.createSimpleLexer(
      [], [
        // Letters preceding numbers specify keywords.
        [PR.PR_KEYWORD, /^[A-Za-z]+/]
       ]
    ),
    ['test-kw']
  );

The different portion

        // Keywords are letters that are followed by a number
        ['lang-test-kw', /^([A-Za-z]+)\d+/],

means, find words followed by numbers and pass the word portion (in group 1) to 
the language handler for 'test-kw' and reparse the remainder using the same 
handler.
This mechanism was originally defined to allow recursive processing of embedded 
content as in the following simplified case from the HTML grammar

     ['lang-css', /^<style[^>]*>(.*?)</style>/]

the style content is pulled out of the tags, and processed as CSS and the tags 
are reprocessed using the HTML grammar which highlights their attributes 
appropriately.

Original comment by mikesamuel@gmail.com on 30 Mar 2012 at 4:35