Closed iankronquist closed 8 years ago
I've been working on a BNF grammar but it's been shoddy since I haven't had internet access for the past few days, and I've been going off of memory (i.e. making up my own version of BNF grammars :dancer:).
This is really exciting, thank you for doing this :D Just a few tweaks before I merge.
The file should probably live at the root of the repo as arua.grammar
. The syntax
and plugin
directories are for ViM.
Also, I'm not sure the best way to represent this; scoping is whitespace-based, and must start with 0+ tabs and 0+ spaces - in that order, strictly (i.e. if you use tabs, great. If you use spaces, great. If you mix, you must start with tabs and end with spaces - you cannot go back and forth). This is to be part of the standard and enforced in the parser.
This means block continuation is determined by the sum of tabs + spaces of the preceding line.
:+1:
Also, I'm having trouble finding a validator for such a syntax. It doesn't appear to be EBNF or ABNF. Is this a domain-specific format that Python instills?
@Qix- This is python's own special format. I used this because of the whitespace scoping. I'll go rename the grammar and fix the widths.
Alright, all ready for another round of review.
Is NAME
simply \w+
? How is that defined?
Python's grammar doesn't define it. I followed their lead. Should I go ahead and define it?
In this C bnf grammar they don't define every term. The undefined terms are lumped together at the top.
A regex for identifiers for Python and I believe C would be [a-zA-z_]\w*
.
Here is more information about how Python does its lexing.
It's worth pointing out that you can't actually define true BNF grammar for Python because BNF is context free and Python is context sensitive because of whitespace. Haskell is also whitespace sensitive, and they take a slightly different approach.
I forgot to ask, what do the looping constructs look like? Are there while, for, and do...while loops?
Python's grammar doesn't define it. I followed their lead. Should I go ahead and define it?
We should probably define it. Identifiers don't include _
in Arua (not yet at least - that will be an RFC). [A-Fa-f][A-Fa-f0-9]*
is the identifier pattern. I say define it simply because there are some things that aren't necessarily standard (namely identifiers not including underscores), and I want to be as explicit as possible. It'd be nice to mesh the standard documentation and the grammar together in the end.
It's worth pointing out that you can't actually define true BNF grammar . . . because BNF is context free and (Arua) is context sensitive because of whitespace.
I was wondering about this. We might have to just document it very well, as you mentioned.
what do the looping constructs look like? Are there while, for, and do...while loops?
Yes, but I'm not entirely sure (yet) how they'll look. I want to re-visit the concept of loop syntax since I've heard so many good arguments against for
loops, for example. There are a million ways to write a loop, and I want to see if there's a deterministic factor between all of them that I can focus on when defining loop syntax.
This is looking really good - thank you for doing this. I'll start writing a parser for this specialized format and get it to generate a parser that we can run against the examples and turn it into a CI step. Once the examples are ironed out, I can start with the initial bootstrap compiler parser.
Thinking more about the terminal rules (e.g. INDENT
, DEDENT
, NEWLINE
, NAME
, etc) let's define all but INDENT
and DEDENT
at the bottom, and make a comment at the top describing the intent behind INDENT
and DEDENT
. This will help me with making the parser generator for this. :+1:
Have you written regexes for your sea of numeric types? There are a lot of them and I don't want to replicate work. I think we need to include syntax rules for strings as well. Will there be separate character and string types?
I'll break down primitive types for you. Most things boil down to simple primitives.
Numeric types
There are three classes of numeric types: signed i
ntegers, u
nsigned integers, and f
loating point numbers.
Integers are declared with variable widths of >= 1. Floating point numbers have a fixed selection of widths: 16, 32, 64, and 128.
[ui][1-9][0-9]*|f(16|32|64|128)
Numeric constant values have three types of notation: based, simple, and scientific.
Based notation has a prefix similar to traditional hexadecimal notation (0x
), but instead of the 0
the base can be specified. Bases 1 - 36 (inclusive) are supported. Bases above 10, which require letters starting with A
(base > 10) and ending with Z
(base == 36), require uppercase. This is to not conflict with scientific notation.
Based notation does not support decimals.
Base 16 supports either the 16x
or 0x
prefixes.
Simple notation is simply base-10 based notation without the prefix (just simple integers, e.g. 12345
). Simple notation allows for decimals.
Simple notation does not allow for leading zeros in the integer segment in order to remove confusion with archaic octal notation (which isn't used anymore).
Put simply, simple notation is either 12345
or 123.45
, and cannot have any leading zeros (i.e. 012345
and 0123.45
are invalid, but 123.00045
is indeed valid).
Scientific notation allows for either a based or simple notation integer with an eN
suffix, where e
is a literal lower-case e and N
is a base-10 integer. XeN
in Arua is equivalent to X * pow(10, N)
.
For example, 1234e2
becomes 123400
, and 1234e-3
becomes 1.234
.
All together, here is a regex that should conform to this, though I'd think the above cases would be split apart into their own rules.
([1-9][0-9]*(\.[0-9]+)?|(0|[1-9][0-9]*)x[0-9A-Z]+)(e\-?[1-9][0-9]*)?
Array types
Array types do not map to a standard library collection type. They are their own fully qualified type. As such, you can apply behaviors (using the on
keyword) to an i8
that doesn't propagate to [i8]
. Similarly, applying behaviors to an [i8]
does not propagate those behaviors to i8
nor any other array type (e.g. [i32]
).
This is very similar behavior to typedefs.
Boolean type
The boolean type is actually just an i1
. It will be an RFC if there is a typedef (from the looks of it, there will be, given similar reasons as the justification behind strings).
true
and false
are just language-wide constants, and are not actually keywords.
true i1 = 1
false i1 = 0
String types
str
is typedef'd to [i8]
since all Arua strings are UTF-8 encoded.
String literals return a str
.
Typedefs In the similar spirit of arrays, typedefs become their own type. Behaviors do not propagate between the underlying type and the newly typedef'd type, unlike C languages.
Typedefs do not currently have a proposed syntax. This should be its own RFC, though :+1:
Arua has pattern matching, and since str
is typedef'd to [i8]
, one can cast using the as
operator because they match the same pattern.
hello str = "Hello!"
utfChars [i8] = hello as [i8]
You can call the hypothetical method .toLowerCase()
on hello
, but not on utfChars
. Similarly, .reverse()
has two different meanings given which variable from the example you call it on. hello.reverse()
will return the string's actual UTF-8 characters in reverse order. This doesn't necessarily mean the underlying bytes will be exactly reversed. However, utfChars.reverse()
will reverse the bytes in linear fashion, breaking any multi-byte glyphs.
I think that types should just be aliases for NAMEs and we should let the parser figure out what is and isn't a valid type. My reasoning is that since we allow typedefing anything can be a valid type.
Well []
aren't allowed in a NAME
, though. It'd at least have to be NAME | '[' NAME ']'
.
Also, part of this grammar is to specify the kinds of types as a reference. Perhaps a scalar_type
is in order.
scalar_type: [ui][1-9][0-9]*
| f16 | f32 | f64 | f128
type: scalar_type
| NAME
| '[' NAME ']'
The parser is going to have a tricky time understanding optional clauses and character classes:
# optional clause
[ expression ]
# character class
[0-9]
The convention should be optional clauses be padded with spaces (which means L6, for example, will need to be changed) and character classes don't have spaces at all.
Good point. An easier solution is to just not use brackets for optional clauses. It's possible to express the same thing just by adding other forms of the rules.
Other than those nits, this looks great. :+1:
Great -- fixed the nits and pushed it up.
Beautiful. Thank you :+1: I hope to see you around here more! This is great stuff.
Thanks, I'm pretty busy with school and stuff, but I'll keep an eye on all this and maybe pitch in when I have time.
Please review this carefully! This will need to updated following changes to the language. Grammars make is more clear what is and is not part of the official language. In my limited experience they make implementing parsers easier too. The syntax here is based on Extended BNF following conventions in Python's Grammar.