Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Trailing context breaks other rules? #156

Closed tlemo closed 1 year ago

tlemo commented 1 year ago

I've hit an unexpected behavior, where the use of trailing context in one rule affects the matching of a different rule:

%%

xyx       out() << "A(" << text() << ")\n";

xy/[^y]   out() << "B(" << text() << ")\n";

.         out() << ".(" << text() << ")\n";

%%
  1. Build the repro:

    reflex reflex_bug.l --main
    g++ lex.yy.cpp libreflex.a
  2. Run the scanner with xyx as input

    ./a.out <<< xyx
    A(xy)
    .(x)

Notice that the first pattern (A) was matched with an incorrect lexeme (xy). Any ideas?

genivia-inc commented 1 year ago

Not sure yet what is happening here. Will dig into this soon.

genivia-inc commented 1 year ago

I should add for clarification that in this example pattern B matches as expected, because it is the longest pattern with the trailing context included. But it should trigger the B rule to display B(xy), not the A rule of course. So the rule index corresponding to the pattern isn't correct.

tlemo commented 1 year ago

I should add for clarification that in this example pattern B matches as expected, because it is the longest pattern with the trailing context included. But it should trigger the B rule to display B(xy), not the A rule of course. So the rule index corresponding to the pattern isn't correct.

For input = xyx, even if the trailing context is included, the match length for both A and B is 3. In that case, isn't the tie resolved in the favor of the first rule (in the definition order)?

genivia-inc commented 1 year ago

Uh, yes... I wasn't paying enough attention to the actual example since this problem really surprised me (thinking ahead what the problem could be). I'm sure it is not too difficult to fix. There is a misplacement in one of the DFA states.

tlemo commented 1 year ago

Great, thanks for looking into it!

genivia-inc commented 1 year ago

I will have to revisit this issue next week as I have no time due to other pressing obligations. I'm very curious what the problem is. This issue shouldn't have happened in the first place. I recall implementing this carefully and testing it.

genivia-inc commented 1 year ago

Finally got to start working on this. A fix should be included in the upcoming release 3.2.12.

genivia-inc commented 1 year ago

The issue is fixed and I will release an update soon.

It turned out to be a super simple fix, because the original RE/flex logic worked. A code change later removed a condition that was critical. Alas, regression tests did not catch that problematic change. So I've added more regression tests to avoid this issue in the future.

tlemo commented 1 year ago

Great, thanks! As much as I'd rather do without trailing contexts, I have a case where they come in handy.