Regular expressions are evaluated incorrectly when an optional subexpression ends in a + or *

(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)

When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.

When an optional subexpression's final term has an "X-or-more" modifier (? or *), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.

Consider the following regular expression: ((##[^\r\n]*)?\r?\n)+ This contains the optional subexpression: (##[^\r\n]*)?

This subexpression should match:

The empty string
Two consecutive # symbols followed by zero or more characters that are not CR or LF.

When the ParseFact parses the subexpression, it calls ParseAtom, which returns an NFA of the form:

S0:
  '#' -> S1
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Then, the '?' is read, and ParseFact calls ParseAtomModifier to rewrite the NFA. ParseAtomModifier computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:

S0:
  '#' -> S1
  nil -> S2
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Because there are outgoing transitions from S2, this changes the expression to (##)?[^\r\n]*, which matches:

The empty string
Any number of non-CR/LF characters

(As an aside, this turns my regex into ((##)?[^\r\n]*\r?\n)+, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)

The fix for this is simple: in ParseAtomModifier, before the "handle supported repeaters" comment:

if (end.outgoing.Length > 0) {
   end = end.AddOut(new NFAEpsilonTransition(end));
}

This causes the method to produce the following NFA:

S0:
  '#' -> S1
  nil -> S3
S1:
  '#' -> S2
S2:
  [^\r\n] -> S2
  nil -> S3
S3: (accept/end state)

This will correctly match:

The empty string
Two consecutive '#' characters followed by any number of non-CR/LF characters.

Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.

cederberg / grammatica

Regular expressions are evaluated incorrectly when an optional subexpression ends in a + or * #10