arr-ai / wbnf

ωBNF implementation
Apache License 2.0
7 stars 4 forks source link

Unify term and regexp syntax #19

Open marcelocantos opened 4 years ago

marcelocantos commented 4 years ago

There are two languages baked into wbnf, the grammar syntax and the regexp syntax, which are already very similar. We should see if we can bring them together as a single unified language.

  1. The following operations have the same syntax and meaning: a|b, ab, a?, a*, a+, a{m,n}, [chars], [^chars], \pN, \PN, re??, re*?, re+?, re{m,n}? and most \-letter combinations.

  2. The following operations have different syntax but the same meaning:

regexp wbnf
(?P<name>a) name=a
(?:a) (a)
  1. Regexps have the following operators, which have no counterpart in the grammar.
type regexp proposed wnbf notes
Numbered capture (re) Won't support. Wbnf has (term) syntax, but it is non-capturing.
Reluctant quantifiers re??, re*?, ... same Implemented for regexps. Should also be implemented for terms.
Flags (?flags) (?flags:re) ?flags Disallowed after a term
Lookaside assertions (?=re) (?!re) (?<=re) (?<!re) (?= term+ ) (?! term+ ) Not supported by RE2, but the lookahead forms might be useful in wbnf as a stopgap till LL(k) or LL(*) is implemented.
Anchors ^ $ same
  1. Regexps currently act as a natural embodiment of a token, which has important implications for the structure of an output AST and both the computational efficiency and cognitive load of working with them. This warrants some kind of syntax to demarcate tokens. The current regexp syntax, /{} will probably suffice for this. Anything inside /{...} will be clumped together as a single token with any internal structure discarded. If the internal structure is needed, it can be extracted by reparsing the text against the internal terms.
    • Currently, /{...} will use the first capturing group as the text of the output token. How will this be done when (...) no longer denotes capturing group? Maybe /{...@=(token)...}? This could perhaps be extended to support tokens as tuples if multiple names appear inside /{...}.
    • This would also support a useful optimisation. If everything inside /{...} can be expressed as regular expressions, the entire form may be compiled as a single regexp matcher.
    • Another concern is that some use cases (grammar analysers, optimisers, grammar transforms, etc.) might need access to the internal structure of a parsed /{...} node. This can be achieved simply by reparsing the token. If it's in the form /{ rule }, this is as simple as running the parser for rule across the text of the output node. For more complex forms, see #18.

Here's an initial stab at elements of the new grammar supporting the above:

COMMENT -> scomment=/{ '//' .* } '\n'
         | mcomment=/{?s '/*' ( [^*] | '*'+ [^*/] ) '*/' };
IDENT   -> /{ '@' | [A-Za-z_\.] \w* };
STR     -> "'" squote=/{ ( `\.` | [^\\'] )* } "'"
         | '"' dquote=/{ ( `\.` | [^\\"] )* } '"'
         | '`' bquote=/{ ( '``' | [^‵]   )* } '`';
INT     -> /{ [\d]+ };
RE      -> '[' neg='^'? chars=/{ ( `\]` | [^\]] )+ } ']';
TOKEN   -> '/{' term* '}';
MOD     -> /{ '?' [ims] }
REF     -> /{ '%' IDENT };