Closed tlemo closed 1 year ago
Why not use [^a]+/a
because that consumes at least one character? Otherwise the scanner will get stuck in an infinite loop. Perhaps your examples is from something that requires this construct? If so, what is the actual use case?
The warning happens because [^a]*
also matches nothing, so the trailing a
overlaps with the other rule that starts with a
.
The warning happens because [^a]* also matches nothing, so the trailing a overlaps with the other rule that starts with a.
Correct, and this is intentional. For the input "a", I'd expect to see a match of the first rule with an empty match, then the second rule with "a". From a FSM perspective, this should be perfectly valid, right?
I understand that the potentially-match-nothing rules can be an unpleasant special case, but when combined with scanner states they have legitimate uses IMO. Consider a delimiter situation (ex. a string literal):
\" { push_state(STRING); }
<STRING>[^"]*/\" { return kStringToken; }
<STRING>\" { pop_state(); }
This could be a workaround for the limitation described in #190, and as far as I can tell, there's nothing fundamentally wrong with it.
Correct, and this is intentional. For the input "a", I'd expect to see a match of the first rule with an empty match, then the second rule with "a". From a FSM perspective, this should be perfectly valid, right?
Nope, that is not the correct interpretation. Both rules are applicable when the input is an "a". An FSM is not a machine that fires two rules. It can only fire one rule at a time. A DFA just finds a pattern match belonging to a rule.
In your string lexer example, why require a trailing " ? The only thing that can happen when there is a missing " is the EOF. You can also pop the state before returning. When combining these observations, the proper way IMO to implement this is something like the following:
\" { push_state(STRING); }
<STRING>[^"]*\" { pop_state(); return kStringToken; }
<STRING><<EOF>> { ... report error non-terminating " ... }
I stand corrected, the e-transition would not work with a DFA. So the reflex warning is valid here.
When combining these observations, the proper way IMO to implement this is something like the following: ...
I don't think that would work as is, since the \"
would re-enter the STRING
state. One option might be to introduce a second state dedicated to closing the quoted string, but I was hoping to avoid that.
At any rate, since this is not a reflex problem I'll close the issue. Thanks again for your help.
When combining these observations, the proper way IMO to implement this is something like the following: ...
I don't think that would work as is, since the
\"
would re-enter theSTRING
state. One option might be to introduce a second state dedicated to closing the quoted string, but I was hoping to avoid that.
Sure it works: the terminating "
goes from the STRING state back to the INITIAL state to continue scanning. When the next "
is scanned in the INITIAL state you go to the STRING state.
Sure it works: the terminating " goes from the STRING state back to the INITIAL state to continue scanning. When the next " is scanned in the INITIAL state you go to the STRING state.
Right, if you consume the terminating \"
, but not if you use it a as a lookahead. Consuming it works (that's what I actually do), except for the problem described in #190, namely the fact that some of the tokens need special truncation. The lookahead was one idea to avoid the need for truncation.
Lexer specification:
reflex
generates a seemingly incorrect warning:warning: rule cannot be matched because a previous rule subsumes it, perhaps try to move this rule up?
The first rule should not subsume the second one, since the lookahead should not be consumed, right?
Generating a lexer demonstrates incorrect behavior:
PS. see also #156