cederberg / grammatica

Grammatica is a C# and Java parser generator (compiler compiler)
https://grammatica.percederberg.net
Other
87 stars 35 forks source link

Regular expressions are evaluated incorrectly when an optional subexpression ends in a + or * #10

Open Stevie-O opened 5 years ago

Stevie-O commented 5 years ago

(Note: I found this in the C# version. I don't know if the bug exists in the Java version, though I expect it does.)

When a Tokenizer is initialized, all terminals defined by regular expressions are converted to NFAs for evaluation during the lexing step.

When an optional subexpression's final term has an "X-or-more" modifier (? or *), the regex is converted incorrectly, producing an NFA that does not correspond to the specified regular expression. Instead, the modifier is applied to the portion of the subexpression before the final term.

Consider the following regular expression: ((##[^\r\n]*)?\r?\n)+ This contains the optional subexpression: (##[^\r\n]*)?

This subexpression should match:

When the ParseFact parses the subexpression, it calls ParseAtom, which returns an NFA of the form:

S0:
  '#' -> S1
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Then, the '?' is read, and ParseFact calls ParseAtomModifier to rewrite the NFA. ParseAtomModifier computes min=0 max=1 and simply adds an epsilon path from the start to the end state, to make it optional. This rewrites the state machine to:

S0:
  '#' -> S1
  nil -> S2
S1:
  '#' -> S2
S2: (accept/end state)
  [^\r\n] -> S2

Because there are outgoing transitions from S2, this changes the expression to (##)?[^\r\n]*, which matches:

(As an aside, this turns my regex into ((##)?[^\r\n]*\r?\n)+, which will match any string that ends in LF as long as all CRs are immediately followed by LF; since that's going to be pretty much any real string input, the regex matched the entire input, which produced some very confusing error messages.)

The fix for this is simple: in ParseAtomModifier, before the "handle supported repeaters" comment:

if (end.outgoing.Length > 0) {
   end = end.AddOut(new NFAEpsilonTransition(end));
}

This causes the method to produce the following NFA:

S0:
  '#' -> S1
  nil -> S3
S1:
  '#' -> S2
S2:
  [^\r\n] -> S2
  nil -> S3
S3: (accept/end state)

This will correctly match:

Is the C# runtime code mechanically translated from the Java runtime code, or was that a one-time translation that's kept in sync manually now? If the latter, I can submit a patch for both, as well as patch to fix several other minor errors in the Java->C# translation code.

cederberg commented 5 years ago

Fantastic bug report! Thank you for taking the time to detail the issue at such length. Amazing!

The C# and Java runtimes are manually kept in sync, where it makes sense. But the C# version has a slightly nicer API, as it supports properties (that Java didn't at the time). If you provide a patch for either one or both, it would be great!