egemadra / recursive-descent

A recursive-descent parser generation tool
MIT License
3 stars 1 forks source link

Regexes with automatic ^ #1

Open anacronw opened 2 months ago

anacronw commented 2 months ago

hello! Doesn't this imply that this parser would not be able to use regexes for grammars with tokens that are side by side?

Regexp: A regexp expression is delimited by '~' symbol. Right after the closing ~, you can put an 'i' to match tokens case insensitively. Such as ~[a-z]+~i Don't put ^, $ or any other anchors. ^ is automatically inserted at the beginning. number = ~[0-9]+(.[0-9]+)?~ ; identifier = ~[A-Za-z][A-Za-z0-9_]*~ ;

For example in the case of:

someTokensomeOtherToken

I can't define:

myTokens: someToken someOtherToken
someToken : 'someToken'
someOthertoken : ~\w{14}~

because the regex is interpreted from the beginning of the line?

Thanks

egemadra commented 2 months ago

Hello,

Regex rules are not interpreted from the beginning of line. They can be (and they are meant to be) used side by side. I checked your case and the situation is different here.

The tokenizer always tries to match the longest text that matches any rules. This is important because when it encounters the string, for example, "classic" we wouldn't want it to interpret it as "class" and "ic" if a lexer rule "class" is also present. When an input string in the form someToken1234567890abcd is parsed, the tokenizer successfully tokenizes the first 14 characters matching your rule someOtherToken. The resulting string is "someToken12345". But then of course it fails, because the remaining characters "67890abcd" do not match any rule in your grammar.

Unfortunately there is not much you can do here as your grammar allows undelimited strings everywhere. If your someOtherToken was defined as ~\w{8}~ the parser would parse the string someToken12345678 as "someToken" and "12345678". This is because "someToken", having 9 characters would be longer and would be matched first, and then the remainder would match the regex rule.

What are you trying to parse? Maybe I can help.