Regexes with automatic ^

Hello,

Regex rules are not interpreted from the beginning of line. They can be (and they are meant to be) used side by side. I checked your case and the situation is different here.

The tokenizer always tries to match the longest text that matches any rules. This is important because when it encounters the string, for example, "classic" we wouldn't want it to interpret it as "class" and "ic" if a lexer rule "class" is also present. When an input string in the form someToken1234567890abcd is parsed, the tokenizer successfully tokenizes the first 14 characters matching your rule someOtherToken. The resulting string is "someToken12345". But then of course it fails, because the remaining characters "67890abcd" do not match any rule in your grammar.

Unfortunately there is not much you can do here as your grammar allows undelimited strings everywhere. If your someOtherToken was defined as ~\w{8}~ the parser would parse the string someToken12345678 as "someToken" and "12345678". This is because "someToken", having 9 characters would be longer and would be matched first, and then the remainder would match the regex rule.

What are you trying to parse? Maybe I can help.

egemadra / recursive-descent

Regexes with automatic ^ #1