Qix- / arua-meta

Standards, RFCs and discussion of the Arua language
44 stars 2 forks source link

Proposed grammar #1

Closed iankronquist closed 8 years ago

iankronquist commented 8 years ago

Please review this carefully! This will need to updated following changes to the language. Grammars make is more clear what is and is not part of the official language. In my limited experience they make implementing parsers easier too. The syntax here is based on Extended BNF following conventions in Python's Grammar.

Qix- commented 8 years ago

I've been working on a BNF grammar but it's been shoddy since I haven't had internet access for the past few days, and I've been going off of memory (i.e. making up my own version of BNF grammars :dancer:).

This is really exciting, thank you for doing this :D Just a few tweaks before I merge.

Qix- commented 8 years ago

The file should probably live at the root of the repo as arua.grammar. The syntax and plugin directories are for ViM.

Qix- commented 8 years ago

Also, I'm not sure the best way to represent this; scoping is whitespace-based, and must start with 0+ tabs and 0+ spaces - in that order, strictly (i.e. if you use tabs, great. If you use spaces, great. If you mix, you must start with tabs and end with spaces - you cannot go back and forth). This is to be part of the standard and enforced in the parser.

This means block continuation is determined by the sum of tabs + spaces of the preceding line.

sindresorhus commented 8 years ago

:+1:

Qix- commented 8 years ago

Also, I'm having trouble finding a validator for such a syntax. It doesn't appear to be EBNF or ABNF. Is this a domain-specific format that Python instills?

iankronquist commented 8 years ago

@Qix- This is python's own special format. I used this because of the whitespace scoping. I'll go rename the grammar and fix the widths.

iankronquist commented 8 years ago

Alright, all ready for another round of review.

Qix- commented 8 years ago

Is NAME simply \w+? How is that defined?

iankronquist commented 8 years ago

Python's grammar doesn't define it. I followed their lead. Should I go ahead and define it? In this C bnf grammar they don't define every term. The undefined terms are lumped together at the top. A regex for identifiers for Python and I believe C would be [a-zA-z_]\w*. Here is more information about how Python does its lexing.

iankronquist commented 8 years ago

It's worth pointing out that you can't actually define true BNF grammar for Python because BNF is context free and Python is context sensitive because of whitespace. Haskell is also whitespace sensitive, and they take a slightly different approach.

iankronquist commented 8 years ago

I forgot to ask, what do the looping constructs look like? Are there while, for, and do...while loops?

Qix- commented 8 years ago

Python's grammar doesn't define it. I followed their lead. Should I go ahead and define it?

We should probably define it. Identifiers don't include _ in Arua (not yet at least - that will be an RFC). [A-Fa-f][A-Fa-f0-9]* is the identifier pattern. I say define it simply because there are some things that aren't necessarily standard (namely identifiers not including underscores), and I want to be as explicit as possible. It'd be nice to mesh the standard documentation and the grammar together in the end.

It's worth pointing out that you can't actually define true BNF grammar . . . because BNF is context free and (Arua) is context sensitive because of whitespace.

I was wondering about this. We might have to just document it very well, as you mentioned.

what do the looping constructs look like? Are there while, for, and do...while loops?

Yes, but I'm not entirely sure (yet) how they'll look. I want to re-visit the concept of loop syntax since I've heard so many good arguments against for loops, for example. There are a million ways to write a loop, and I want to see if there's a deterministic factor between all of them that I can focus on when defining loop syntax.

Qix- commented 8 years ago

This is looking really good - thank you for doing this. I'll start writing a parser for this specialized format and get it to generate a parser that we can run against the examples and turn it into a CI step. Once the examples are ironed out, I can start with the initial bootstrap compiler parser.

Qix- commented 8 years ago

Thinking more about the terminal rules (e.g. INDENT, DEDENT, NEWLINE, NAME, etc) let's define all but INDENT and DEDENT at the bottom, and make a comment at the top describing the intent behind INDENT and DEDENT. This will help me with making the parser generator for this. :+1:

iankronquist commented 8 years ago

Have you written regexes for your sea of numeric types? There are a lot of them and I don't want to replicate work. I think we need to include syntax rules for strings as well. Will there be separate character and string types?

Qix- commented 8 years ago

I'll break down primitive types for you. Most things boil down to simple primitives.

Numeric types

There are three classes of numeric types: signed integers, unsigned integers, and floating point numbers.

Integers are declared with variable widths of >= 1. Floating point numbers have a fixed selection of widths: 16, 32, 64, and 128.

[ui][1-9][0-9]*|f(16|32|64|128)

Numeric constant values have three types of notation: based, simple, and scientific.

Based notation has a prefix similar to traditional hexadecimal notation (0x), but instead of the 0 the base can be specified. Bases 1 - 36 (inclusive) are supported. Bases above 10, which require letters starting with A (base > 10) and ending with Z (base == 36), require uppercase. This is to not conflict with scientific notation.

Based notation does not support decimals.

Base 16 supports either the 16x or 0x prefixes.

Simple notation is simply base-10 based notation without the prefix (just simple integers, e.g. 12345). Simple notation allows for decimals.

Simple notation does not allow for leading zeros in the integer segment in order to remove confusion with archaic octal notation (which isn't used anymore).

Put simply, simple notation is either 12345 or 123.45, and cannot have any leading zeros (i.e. 012345 and 0123.45 are invalid, but 123.00045 is indeed valid).

Scientific notation allows for either a based or simple notation integer with an eN suffix, where e is a literal lower-case e and N is a base-10 integer. XeN in Arua is equivalent to X * pow(10, N).

For example, 1234e2 becomes 123400, and 1234e-3 becomes 1.234.

All together, here is a regex that should conform to this, though I'd think the above cases would be split apart into their own rules.

([1-9][0-9]*(\.[0-9]+)?|(0|[1-9][0-9]*)x[0-9A-Z]+)(e\-?[1-9][0-9]*)?

Array types Array types do not map to a standard library collection type. They are their own fully qualified type. As such, you can apply behaviors (using the on keyword) to an i8 that doesn't propagate to [i8]. Similarly, applying behaviors to an [i8] does not propagate those behaviors to i8 nor any other array type (e.g. [i32]).

This is very similar behavior to typedefs.

Boolean type The boolean type is actually just an i1. It will be an RFC if there is a typedef (from the looks of it, there will be, given similar reasons as the justification behind strings).

true and false are just language-wide constants, and are not actually keywords.

true i1 = 1
false i1 = 0

String types str is typedef'd to [i8] since all Arua strings are UTF-8 encoded.

String literals return a str.

Typedefs In the similar spirit of arrays, typedefs become their own type. Behaviors do not propagate between the underlying type and the newly typedef'd type, unlike C languages.

Typedefs do not currently have a proposed syntax. This should be its own RFC, though :+1:

Arua has pattern matching, and since str is typedef'd to [i8], one can cast using the as operator because they match the same pattern.

hello str = "Hello!"
utfChars [i8] = hello as [i8]

You can call the hypothetical method .toLowerCase() on hello, but not on utfChars. Similarly, .reverse() has two different meanings given which variable from the example you call it on. hello.reverse() will return the string's actual UTF-8 characters in reverse order. This doesn't necessarily mean the underlying bytes will be exactly reversed. However, utfChars.reverse() will reverse the bytes in linear fashion, breaking any multi-byte glyphs.

iankronquist commented 8 years ago

I think that types should just be aliases for NAMEs and we should let the parser figure out what is and isn't a valid type. My reasoning is that since we allow typedefing anything can be a valid type.

Qix- commented 8 years ago

Well [] aren't allowed in a NAME, though. It'd at least have to be NAME | '[' NAME ']'.

Also, part of this grammar is to specify the kinds of types as a reference. Perhaps a scalar_type is in order.

scalar_type: [ui][1-9][0-9]*
           | f16 | f32 | f64 | f128

type: scalar_type
    | NAME
    | '[' NAME ']'
Qix- commented 8 years ago

The parser is going to have a tricky time understanding optional clauses and character classes:

# optional clause
[ expression ]

# character class
[0-9]

The convention should be optional clauses be padded with spaces (which means L6, for example, will need to be changed) and character classes don't have spaces at all.

iankronquist commented 8 years ago

Good point. An easier solution is to just not use brackets for optional clauses. It's possible to express the same thing just by adding other forms of the rules.

Qix- commented 8 years ago

Other than those nits, this looks great. :+1:

iankronquist commented 8 years ago

Great -- fixed the nits and pushed it up.

Qix- commented 8 years ago

Beautiful. Thank you :+1: I hope to see you around here more! This is great stuff.

iankronquist commented 8 years ago

Thanks, I'm pretty busy with school and stuff, but I'll keep an eye on all this and maybe pitch in when I have time.