Define constrained variants of & and ~ for better error handling

Consider the following definitions

a & b matches the largest sequence that matches a and matches b.
a ~ b matches the largest sequence that matches a but doesn't match b.

An interesting variant of these that could be very useful for implementing better error messages is the following:

a &? b matches the largest sequence matching a, but signals an error if b doesn't match the sequence.
a ~? b matches the largest sequence matching a, but signals an error if b also matches the sequence.

These operators are much easier to implement, since they don't require a possibly character-by-character search for a match on both sides at the same time (consider aa? ~ aa, which doesn't match aa but does match a.

Secondly — and this is the primary motivation — the new variants would make it easier to define error messages without sacrificing formal correctness. As an example, let's consider the following simple string syntax with some support for escapes:

STRING -> /{ " (?: \\x[[:xdigit:]]{2} | \\[nrt"\\] | [^\\] ) " } (n.b., spaces are ignored per #8)

The problem with this syntax, is that by being being very tight, aka "correct", on what constitutes a valid string, it will reject many almost-correct forms, such as "\x3g" and "abc\vdef", which were clearly intended to be strings or, at the very least can't be anything else. Rather than an error stating that the string was malformed, it will reflect that no string was matched at all, and, if the parser wanders off to try other matches before giving up, the diagnostic messages might be less than helpful.

Consider, now, the following syntax:

STRING -> /{ " (?: [^"] | \\. ) " } &? /{...as above...}

The LHS of the &? expression is much sloppier, matching pretty much anything between a pair of ", but with correct escape-handling for \". In essence, this term matches anything that "looks like" a string. The parser sees it not as a failed match but as a matched STRING with errors. The parser can then generate a full AST with "\x3g" in it but reporting that there's a problem with that particular node.

Another handy usage is name -> IDENT ~? keyword. In this case, rather than name -> IDENT ~ keyword confusing the author with name not found, the new form could produce keyword "var" not allowed as name.

When a parser encounters such errors, it could choose to fail immediately, but adding errors to the AST makes it possible to perform downstream processing that is still useful in spite of the error. For example, a syntax highlighter doesn't want to give up coloring in just because there's one malformed string. Moreover, it wants to also color in malformed strings, but in some error-signalling way.

Make sure chains work as intended: a &? b &? c and a ~? b ~? c and even a &? b ~? c.

Caveat: This is a hinting system, which I'm opposed to in principle. However, auto-detecting intent is a hard problem, so I'm putting it into the too-hard basket for now and allowing the hinting approach in the meantime.

arr-ai / wbnf

Define constrained variants of & and ~ for better error handling #10