ωBNF is pronounced "omega BNF".
An ωBNF grammar file consists of an unordered list of rules (called productions) or comments.
A Comment can be either C++-style // This is a comment to the end of the line
or C-style /* This is a comment which may span multiple lines */
A rule is defined in terms of terms, or terminals in the form of NAME -> TERM+ ;
:
a -> b;
indicates a rule named a
which is made up of single term (in
this case the rule b
)a -> b (c d)*;
indicates a rule named a
which is made up of several
terms (in this case the rule b
followed by the sequence of terms c
and d
, which may occur zero or more times in succession)a -> "Token";
indicates a rule named a
which is made of a single
terminal (in this case the string Token
)a -> \d+;
indicates a rule named a
which is made of a single
terminal (in this case the regex \d+
)"
or '
or ` (backquote)Regular Expressions in the form /{RE}
, where RE is the expression to
match. The entire match will be consumed. The parser will use the first
capturing group to populate the output node, or the entire match if there is
none.
Regular expressions mostly conform to RE2 syntax, with the following exceptions:
\_
.}
must be escaped with \}
.The following simple RE forms may omit the surrounding /{…}
:
.
, ^
and $
[…]
and [^…]
\d
where d is an RE2 character class.\pN
or \PN
where N is a single-letter Unicode character class\p{…}
\P{…}
All simple forms may include a quantifier: ?
, *
, +
, {m,n?}
, {n}
and,
optionally, an additional ?
to make the quantifier reluctant (finds the
shortest matching input).
A sequence of the above simple forms with no whitespace in between is treated
as a single regexp, e.g.: ^[a-z][a-z0-9]*$
.
Terms can be grouped in various ways to build up rules.
Sequence
a -> left [-+*/] right;
This rule requires a left
followed by one of the math symbols followed by
a right
.
Choice
a -> ("hello" | "goodbye") name;
This rule requires either the string hello
or goodbye
followed by a
name
.
Simple Multiplicity
+
, ?
, and *
may follow any term to indicate how many occurrences of
the term will be matched.
+
indicates 1 or more occurrences.?
indicates 0 or 1 occurrence.*
indicates 0 or more occurrences.a -> b* ("x" | "y")+;
matches any amount of b
followed by at least one of
either "x"
or "y"
Delimited repetition
One of the design goals of the grammar is to minimise the amount of repetition in each term. If we wanted to create a rule to accept a comma separated list of words the simple version would be:
word -> \w*;
csv -> word ("," word)*;
The rule word
appears 3 times in that tiny snippet! This can be eliminated
with the use of the :
operator after a term: csv -> \w*:",";
expresses
the same rule. (More on this operator below)
Min/Max repetition
a -> "x"{3,9}
indicates that a string of at least 3 x
up to 9 will be
accepted.
0
will be the assumed minimum.unlimited
with be the assumed maximumPrecedence Stacks
Languages often require some way to define the order of operations (remember BODMAS from school?).
The simple form of a math parser would include something like:
expr -> braces+;
braces -> ("(" multdiv+ ")") | multdiv;
multdiv -> addsum (("*" | "/") addsum)*;
addsum -> number (("+" | "-") number)*;
number -> \d+;
This again has heaps of repetition both in each rule and between rules (as each refers to the next in the precedence order).
The >
operator can help with this (newlines are generally ignored by the
parser).
expr -> @:[-+]
> @:[*/]
> "(" @ ")"
> \d+;
In each line of the stack, the @ term implicitly refers to the next line down. Terms further along have a higher precedence than earlier terms.
The above grammar will parse the expression 1 + 2 * 3
as the following
nodes:
"+"
┌─┴─┐
1 "*"
┌─┴─┐
2 3
giving the result 7.
Named Terms
Terms in a rule may be named as a convenience item.
expr -> @:op=[-+]
> @:op=[*/]
> "(" @ ")"
> \d+;
This is the same math grammar as above, except two lines have op=
for the
delimiter term name.
Referenced Terms
TODO: Fill in, not sure how to word this
This is the definition of the delimited repeater op=/{<:|:>?} opt_leading=","? named opt_trailing=","?
.
op
describes the associativity of the separated terms. All forms of the term
a op b
will match sequences of the form a b a ... b a
.
:
denotes a non-associative delimiter. The term a:b
will produce trees
conceptually like the following diagram. The parser will emit a single node
with all terms and delimiters in it: b b
┌─┴─┬ ─ ─ ┴─┐
a a a
:>
denotes a left-to-right associative delimiter. This will produce a
chain of binary nodes. The term a:>b
will produce trees looking like this: b
┌─┴─┐
b a
┌─┴─┐
b a
┌─┴─┐
a a
<:
denotes a right-to-left associative delimiter. The term a<:b
will
produce trees looking like this: b
┌─┴─┐
a b
┌─┴─┐
a b
┌─┴─┐
a a
opt_leading
and opt_trailing
are optional markers used to allow the
separator to start and/or end the sequence.
opt_leading
is present, the sequence is allowed to start with the
separater term.opt_trailing
is present, the sequence is allowed to end with the
separater term.Examples:
x -> a:b,
- Allows ab...aba
or ab...ab
(where ...
represents any
amount of ab
)x -> a:,b
- Allows ab...aba
or bab..aba
(where ...
represents any
amount of ab
)Some special commands are defined in the grammar to control the way the parser executes.
.import relative_filename
Allows the wbnf file to merge the grammar of the imported filename into the current grammar (equivalent to #include
in c)
.macro Name(args) { term }
Allows the use of macros to minimise repetition in the grammar (see below)
Macros can be used when a common pattern is required through the grammar which cant easily be converted to a rule.
Macros are conceptually the same as C-style #define
's, except rather than simply substituting text, a full expression can be used.
We will explain how to use macros by implementing the equivalent of the delimited repeater
.
First a macro is defined .macro Delim(term, sep) { term (sep term)* }
, and used %!Delim(a, "<"? ":" ">"? )
This would expand to a (("<"? ":" ">"?) a)*
which is equivalent of a:("<"? ":" ">"?)
Rules prefixed by a .
are special rules governing the parser's overall
behaviour. The following rules are recognised:
.wrapRE -> /{some () regex}
This rule instructs the parser to wrap every regular expression with this one.
The actual regex is inserted into the ()
.
Example:
.wrapRE -> /{\s*()\s*};
ignores all whitespace surrounding every token in the
grammar..wrapRE -> "--" | [0-9] | /{\s*()\s*};
ignores surrounding whitespace, as
above, but excludes any instance of terms "--"
, and [0-9]
(including
/{[0-9]}
) from wrapping.Below are a collection of helpful rules which can be dropped into your grammar.
block -> indent=(\n+ %indent="\n" \s+) stmt:%indent;
accepts an indented stmt
node.
.macro Indented(term) { indent=(\n+ %indent="\n" \s+) term:%indent }
Macro to simplify adding indentation (use like %!Indented(term)
)
The ωBNF syntax described above is itself implemented in ωBNF. The following grammar is auto-generated from the formal grammar used in the ωBNF parsing engine.
// Non-terminals
grammar -> stmt+;
stmt -> COMMENT | prod | pragma;
prod -> IDENT "->" term+ ";";
term -> (@ ("{" grammar "}")? ):op=">"
> @:op="|"
> @+
> named quant*;
named -> (IDENT op="=")? atom;
quant -> op=[?*+]
| "{" min=INT? "," max=INT? "}"
| op=/{<:|:>?} opt_leading=","? named opt_trailing=","?;
atom -> IDENT
| STR
| RE
| macrocall
| ExtRef=("%%" IDENT)
| REF
| "(?=" lookahead=term ")"
| "(" term ")"
| "(" ")";
macrocall -> "%!" name=IDENT "(" term:","? ")";
REF -> "%" IDENT ("=" default=STR)?;
// Terminals
COMMENT -> /{ //.*$
| (?s: /\* (?: [^*] | \*+[^*/] ) \*/ )
};
IDENT -> /{@|\.?[A-Za-z_]\w*};
INT -> \d+;
STR -> /{ " (?: \\. | [^\\"] )* "
| ' (?: \\. | [^\\'] )* '
| ` (?: `` | [^`] )* `
};
RE -> /{
/{
(?:
\\.
| { (?: (?: \d+(?:,\d*)? | ,\d+ ) \} )?
| \[ (?: \\. | \[:^?[a-z]+:\] | [^\]] )+ ]
| [^\\{\}]
)*
\}
| (?:
(?:
\[ (?: \\. | \[:^?[a-z]+:\] | [^\]] )+ ]
| \\[pP](?:[a-z]|\{[a-zA-Z_]+\})
| \\[a-zA-Z]
| [.^$]
)(?: (?:[+*?]|\{\d+,?\d?\}) \?? )?
)+
};
// Special
pragma -> import | macrodef {
import -> ".import" path=((".."|"."|[a-zA-Z0-9.:]+):,"/") ";"?;
macrodef -> ".macro" name=IDENT "(" args=IDENT:","? ")" "{" term "}" ";"?;
};
.wrapRE -> /{\s*()\s*};