There are two languages baked into wbnf, the grammar syntax and the regexp syntax, which are already very similar. We should see if we can bring them together as a single unified language.
The following operations have the same syntax and meaning: a|b, ab, a?, a*, a+, a{m,n}, [chars], [^chars], \pN, \PN, re??, re*?, re+?, re{m,n}? and most \-letter combinations.
The following operations have different syntax but the same meaning:
regexp
wbnf
(?P<name>a)
name=a
(?:a)
(a)
Regexps have the following operators, which have no counterpart in the grammar.
type
regexp
proposed wnbf
notes
Numbered capture
(re)
—
Won't support. Wbnf has (term) syntax, but it is non-capturing.
Reluctant quantifiers
re??, re*?, ...
same
Implemented for regexps. Should also be implemented for terms.
Flags
(?flags)(?flags:re)
?flags
Disallowed after a term
Lookaside assertions
(?=re)(?!re)(?<=re)(?<!re)
(?= term+ )(?! term+ )
Not supported by RE2, but the lookahead forms might be useful in wbnf as a stopgap till LL(k) or LL(*) is implemented.
Anchors
^$
same
Regexps currently act as a natural embodiment of a token, which has important implications for the structure of an output AST and both the computational efficiency and cognitive load of working with them. This warrants some kind of syntax to demarcate tokens. The current regexp syntax, /{} will probably suffice for this. Anything inside /{...} will be clumped together as a single token with any internal structure discarded. If the internal structure is needed, it can be extracted by reparsing the text against the internal terms.
Currently, /{...} will use the first capturing group as the text of the output token. How will this be done when (...) no longer denotes capturing group? Maybe /{...@=(token)...}? This could perhaps be extended to support tokens as tuples if multiple names appear inside /{...}.
This would also support a useful optimisation. If everything inside /{...} can be expressed as regular expressions, the entire form may be compiled as a single regexp matcher.
Another concern is that some use cases (grammar analysers, optimisers, grammar transforms, etc.) might need access to the internal structure of a parsed /{...} node. This can be achieved simply by reparsing the token. If it's in the form /{ rule }, this is as simple as running the parser for rule across the text of the output node. For more complex forms, see #18.
Here's an initial stab at elements of the new grammar supporting the above:
There are two languages baked into wbnf, the grammar syntax and the regexp syntax, which are already very similar. We should see if we can bring them together as a single unified language.
The following operations have the same syntax and meaning:
a|b
,ab
,a?
,a*
,a+
,a{m,n}
,[chars]
,[^chars]
,\pN
,\PN
,re??
,re*?
,re+?
,re{m,n}?
and most\
-letter combinations.The following operations have different syntax but the same meaning:
(?P<name>a)
name=a
(?:a)
(a)
(re)
(term)
syntax, but it is non-capturing.re??
,re*?
, ...(?flags)
(?flags:re)
?flags
(?=re)
(?!re)
(?<=re)
(?<!re)
(?= term+ )
(?! term+ )
^
$
/{}
will probably suffice for this. Anything inside/{...}
will be clumped together as a single token with any internal structure discarded. If the internal structure is needed, it can be extracted by reparsing the text against the internal terms./{...}
will use the first capturing group as the text of the output token. How will this be done when(...)
no longer denotes capturing group? Maybe/{...@=(token)...}
? This could perhaps be extended to support tokens as tuples if multiple names appear inside/{...}
./{...}
can be expressed as regular expressions, the entire form may be compiled as a single regexp matcher./{...}
node. This can be achieved simply by reparsing the token. If it's in the form/{ rule }
, this is as simple as running the parser forrule
across the text of the output node. For more complex forms, see #18.Here's an initial stab at elements of the new grammar supporting the above: