Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Trailing context (lookahead) problems #189

Closed tlemo closed 11 months ago

tlemo commented 1 year ago

Lexer specification:

%%

[^a]*/a   out() << "r1(" << text() << ")\n";
a         out() << "r2(" << text() << ")\n";

%%

reflex generates a seemingly incorrect warning: warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?

The first rule should not subsume the second one, since the lookahead should not be consumed, right?

Generating a lexer demonstrates incorrect behavior:

echo xxxa | ./a.out 
r1(xxx)
a <- expected r2(a)

echo a | ./a.out 
a <- expected r2(a)

PS. see also #156

genivia-inc commented 11 months ago

Why not use [^a]+/a because that consumes at least one character? Otherwise the scanner will get stuck in an infinite loop. Perhaps your examples is from something that requires this construct? If so, what is the actual use case?

The warning happens because [^a]* also matches nothing, so the trailing a overlaps with the other rule that starts with a.

tlemo commented 11 months ago

The warning happens because [^a]* also matches nothing, so the trailing a overlaps with the other rule that starts with a.

Correct, and this is intentional. For the input "a", I'd expect to see a match of the first rule with an empty match, then the second rule with "a". From a FSM perspective, this should be perfectly valid, right?

I understand that the potentially-match-nothing rules can be an unpleasant special case, but when combined with scanner states they have legitimate uses IMO. Consider a delimiter situation (ex. a string literal):

\"   { push_state(STRING); }
<STRING>[^"]*/\" { return kStringToken; }
<STRING>\" { pop_state(); }

This could be a workaround for the limitation described in #190, and as far as I can tell, there's nothing fundamentally wrong with it.

genivia-inc commented 11 months ago

Correct, and this is intentional. For the input "a", I'd expect to see a match of the first rule with an empty match, then the second rule with "a". From a FSM perspective, this should be perfectly valid, right?

Nope, that is not the correct interpretation. Both rules are applicable when the input is an "a". An FSM is not a machine that fires two rules. It can only fire one rule at a time. A DFA just finds a pattern match belonging to a rule.

In your string lexer example, why require a trailing " ? The only thing that can happen when there is a missing " is the EOF. You can also pop the state before returning. When combining these observations, the proper way IMO to implement this is something like the following:

\"   { push_state(STRING); }
<STRING>[^"]*\" { pop_state(); return kStringToken; }
<STRING><<EOF>> { ... report error non-terminating " ... }
tlemo commented 11 months ago

I stand corrected, the e-transition would not work with a DFA. So the reflex warning is valid here.

When combining these observations, the proper way IMO to implement this is something like the following: ...

I don't think that would work as is, since the \" would re-enter the STRING state. One option might be to introduce a second state dedicated to closing the quoted string, but I was hoping to avoid that.

At any rate, since this is not a reflex problem I'll close the issue. Thanks again for your help.

genivia-inc commented 11 months ago

When combining these observations, the proper way IMO to implement this is something like the following: ...

I don't think that would work as is, since the \" would re-enter the STRING state. One option might be to introduce a second state dedicated to closing the quoted string, but I was hoping to avoid that.

Sure it works: the terminating " goes from the STRING state back to the INITIAL state to continue scanning. When the next " is scanned in the INITIAL state you go to the STRING state.

tlemo commented 11 months ago

Sure it works: the terminating " goes from the STRING state back to the INITIAL state to continue scanning. When the next " is scanned in the INITIAL state you go to the STRING state.

Right, if you consume the terminating \", but not if you use it a as a lookahead. Consuming it works (that's what I actually do), except for the problem described in #190, namely the fact that some of the tokens need special truncation. The lookahead was one idea to avoid the need for truncation.