Closed GoogleCodeExporter closed 8 years ago
The problem here is in the way the regular expressions are used.
First, all the regular expressions are combined together to create a lexical
tokenizer.
/(?:[A-Za-z]+(?=\d)|(?:[A-Za-z]+)|\d+)/g
which is then globally matched against the string to produce an array of tokens.
Second, the regular expressions are used to classify each individual token.
Since the individual token does not contain the lookahead characters, only the
second, less restrictive one matches: [A-Za-z]+
The solution is to use embedded sub-languages.
PR.registerLangHandler(
PR.createSimpleLexer(
[], [
// Keywords are letters that are followed by a number
['lang-test-kw', /^([A-Za-z]+)\d+/],
// Other letters designate types
[PR.PR_TYPE, /^[A-Za-z]+/],
// Numbers are literals, not part of the keywords
[PR.PR_LITERAL, /^\d+/]
]
),
['test']
);
PR.registerLangHandler(
PR.createSimpleLexer(
[], [
// Letters preceding numbers specify keywords.
[PR.PR_KEYWORD, /^[A-Za-z]+/]
]
),
['test-kw']
);
The different portion
// Keywords are letters that are followed by a number
['lang-test-kw', /^([A-Za-z]+)\d+/],
means, find words followed by numbers and pass the word portion (in group 1) to
the language handler for 'test-kw' and reparse the remainder using the same
handler.
This mechanism was originally defined to allow recursive processing of embedded
content as in the following simplified case from the HTML grammar
['lang-css', /^<style[^>]*>(.*?)</style>/]
the style content is pulled out of the tags, and processed as CSS and the tags
are reprocessed using the HTML grammar which highlights their attributes
appropriately.
Original comment by mikesamuel@gmail.com
on 30 Mar 2012 at 4:35
Original issue reported on code.google.com by
kennedyri
on 31 Jan 2012 at 11:34