carbon-language / carbon-lang

Carbon Language's main repository: documents, design, implementation, and related tools. (NOTE: Carbon Language is experimental; see README)
https://github.com/carbon-language/carbon-lang/blob/trunk/README.md
Other
32.3k stars 1.48k forks source link

Grammar railroad diagram #1423

Open mingodad opened 1 year ago

mingodad commented 1 year ago

Obs: Updated with the grammar on 2022-10-09

Using this fork of bison (https://github.com/mingodad/lalr-parser-test) to generate an EBNF understood by https://www.bottlecaps.de/rr/ui and manually adding the tokens from lexer.lpp we can have a navigable railroad diagram (https://en.wikipedia.org/wiki/Syntax_diagram) to help develop/debug/document the language.

Copy and paste the EBNF shown bellow on https://www.bottlecaps.de/rr/ui in the Edit Grammar tab then click the View Diagram tab to get a navigable railroad diagram.

/* converted on Thu Oct 6, 2022, 14:13 (UTC+02) by bison-to-w3c v0.59 which is Copyright (c) 2011-2022 by Gunther Rademacher <grd@gmx.net> */

input    ::= package_directive import_directive* declaration*
package_directive
         ::= PACKAGE identifier optional_library_path api_or_impl SEMICOLON
import_directive
         ::= IMPORT identifier optional_library_path SEMICOLON
optional_library_path
         ::= ( LIBRARY string_literal )?
api_or_impl
         ::= API
           | IMPL
primary_expression
         ::= identifier
           | designator
           | PERIOD? SELF
           | integer_literal
           | string_literal
           | TRUE
           | FALSE
           | sized_type_literal
           | STRING
           | BOOL
           | TYPE
           | CONTINUATION_TYPE
           | paren_expression
           | struct_literal
           | struct_type_literal
           | LEFT_SQUARE_BRACKET expression SEMICOLON expression RIGHT_SQUARE_BRACKET
postfix_expression
         ::= ( primary_expression | intrinsic_identifier tuple ) ( designator | PERIOD LEFT_PARENTHESIS expression RIGHT_PARENTHESIS | LEFT_SQUARE_BRACKET expression RIGHT_SQUARE_BRACKET | tuple | POSTFIX_STAR | UNARY_STAR )*
ref_deref_expression
         ::= ( PREFIX_STAR | UNARY_STAR | AMPERSAND )* postfix_expression
fn_type_expression
         ::= FN_TYPE tuple ARROW type_expression
type_expression
         ::= ref_deref_expression
           | bitwise_and_expression
           | fn_type_expression
minus_expression
         ::= MINUS ref_deref_expression
complement_expression
         ::= CARET ref_deref_expression
unary_expression
         ::= minus_expression
           | complement_expression
simple_binary_operand
         ::= ref_deref_expression
           | unary_expression
multiplicative_lhs
         ::= simple_binary_operand
           | multiplicative_expression
multiplicative_expression
         ::= multiplicative_lhs ( BINARY_STAR | SLASH ) simple_binary_operand
additive_operand
         ::= simple_binary_operand
           | multiplicative_expression
additive_lhs
         ::= simple_binary_operand
           | additive_expression
additive_expression
         ::= multiplicative_expression
           | additive_lhs ( PLUS | MINUS ) additive_operand
modulo_expression
         ::= simple_binary_operand PERCENT simple_binary_operand
bitwise_and_lhs
         ::= simple_binary_operand
           | bitwise_and_expression
bitwise_and_expression
         ::= bitwise_and_lhs AMPERSAND simple_binary_operand
bitwise_or_lhs
         ::= simple_binary_operand
           | bitwise_or_expression
bitwise_or_expression
         ::= bitwise_or_lhs PIPE simple_binary_operand
bitwise_xor_lhs
         ::= simple_binary_operand
           | bitwise_xor_expression
bitwise_xor_expression
         ::= bitwise_xor_lhs CARET simple_binary_operand
bitwise_expression
         ::= bitwise_and_expression
           | bitwise_or_expression
           | bitwise_xor_expression
bit_shift_expression
         ::= simple_binary_operand ( LESS_LESS | GREATER_GREATER ) simple_binary_operand
as_expression
         ::= simple_binary_operand AS simple_binary_operand
unimpl_expression
         ::= ref_deref_expression UNIMPL_EXAMPLE ref_deref_expression
value_expression
         ::= additive_expression
           | as_expression
           | bitwise_expression
           | bit_shift_expression
           | fn_type_expression
           | modulo_expression
           | unary_expression
           | unimpl_expression
comparison_operand
         ::= ref_deref_expression
           | value_expression
comparison_operator
         ::= EQUAL_EQUAL
           | LESS
           | LESS_EQUAL
           | GREATER
           | GREATER_EQUAL
           | NOT_EQUAL
comparison_expression
         ::= value_expression
           | comparison_operand comparison_operator comparison_operand
not_expression
         ::= NOT ref_deref_expression
predicate_expression
         ::= not_expression
           | comparison_expression
and_or_operand
         ::= ref_deref_expression
           | predicate_expression
and_lhs  ::= and_or_operand
           | and_expression
and_expression
         ::= and_lhs AND and_or_operand
or_lhs   ::= and_or_operand
           | or_expression
or_expression
         ::= or_lhs OR and_or_operand
where_clause
         ::= comparison_operand ( IS | EQUAL_EQUAL ) comparison_operand
where_expression
         ::= type_expression WHERE where_clause ( AND where_clause )*
type_or_where_expression
         ::= type_expression
           | where_expression
statement_expression
         ::= ref_deref_expression
           | predicate_expression
           | and_expression
           | or_expression
           | where_expression
if_expression
         ::= statement_expression
           | IF expression THEN if_expression ELSE if_expression
expression
         ::= if_expression
designator
         ::= PERIOD identifier
paren_expression
         ::= paren_expression_base
tuple    ::= paren_expression_base
paren_expression_base
         ::= LEFT_PARENTHESIS ( paren_expression_contents COMMA? )? RIGHT_PARENTHESIS
paren_expression_contents
         ::= expression ( COMMA expression )*
struct_literal
         ::= LEFT_CURLY_BRACE struct_literal_contents COMMA? RIGHT_CURLY_BRACE
struct_literal_contents
         ::= designator EQUAL expression ( COMMA designator EQUAL expression )*
struct_type_literal
         ::= LEFT_CURLY_BRACE ( struct_type_literal_contents COMMA? )? RIGHT_CURLY_BRACE
struct_type_literal_contents
         ::= designator COLON expression ( COMMA designator COLON expression )*
pattern  ::= non_expression_pattern
           | expression
non_expression_pattern
         ::= VAR* ( AUTO | binding_lhs ( COLON pattern | COLON_BANG expression ) | paren_pattern | postfix_expression tuple_pattern )
binding_lhs
         ::= identifier
           | UNDERSCORE
paren_pattern
         ::= paren_pattern_base
paren_pattern_base
         ::= LEFT_PARENTHESIS paren_pattern_contents COMMA? RIGHT_PARENTHESIS
paren_pattern_contents
         ::= ( paren_expression_contents COMMA )? non_expression_pattern ( COMMA ( expression | non_expression_pattern ) )*
tuple_pattern
         ::= paren_pattern_base
maybe_empty_tuple_pattern
         ::= LEFT_PARENTHESIS RIGHT_PARENTHESIS
           | tuple_pattern
clause   ::= ( CASE pattern | DEFAULT ) DOUBLE_ARROW block
statement
         ::= ( ( statement_expression | VAR pattern | RETURNED VAR variable_declaration ) ( EQUAL expression )? | ( LET pattern EQUAL | RUN ) expression | BREAK | CONTINUE | AWAIT | RETURN ( return_expression | VAR ) ) SEMICOLON
           | if_statement
           | ( ( WHILE LEFT_PARENTHESIS expression | FOR LEFT_PARENTHESIS variable_declaration IN type_expression ) RIGHT_PARENTHESIS | CONTINUATION identifier ) block
           | MATCH LEFT_PARENTHESIS expression RIGHT_PARENTHESIS LEFT_CURLY_BRACE clause* RIGHT_CURLY_BRACE
if_statement
         ::= IF LEFT_PARENTHESIS expression RIGHT_PARENTHESIS block optional_else
optional_else
         ::= ( ELSE ( if_statement | block ) )?
return_expression
         ::= expression?
block    ::= LEFT_CURLY_BRACE statement* RIGHT_CURLY_BRACE
return_term
         ::= ( ARROW ( AUTO | expression ) )?
generic_binding
         ::= identifier COLON_BANG expression
deduced_param
         ::= generic_binding
           | ADDR? variable_declaration
deduced_param_list
         ::= deduced_param? ( COMMA deduced_param )*
deduced_params
         ::= ( LEFT_SQUARE_BRACKET deduced_param_list RIGHT_SQUARE_BRACKET )?
impl_deduced_params
         ::= ( FORALL LEFT_SQUARE_BRACKET deduced_param_list RIGHT_SQUARE_BRACKET )?
function_declaration
         ::= FN identifier deduced_params maybe_empty_tuple_pattern return_term ( block | SEMICOLON )
variable_declaration
         ::= identifier COLON pattern
alias_declaration
         ::= ALIAS identifier EQUAL expression SEMICOLON
mix_declaration
         ::= MIX expression SEMICOLON
alternative
         ::= identifier tuple?
alternative_list
         ::= ( alternative_list_contents COMMA? )?
alternative_list_contents
         ::= alternative ( COMMA alternative )*
type_params
         ::= tuple_pattern?
mixin_import
         ::= ( FOR expression )?
class_declaration_extensibility
         ::= ( ABSTRACT | BASE )?
class_declaration_extends
         ::= ( EXTENDS expression )?
declaration
         ::= function_declaration
           | destructor_declaration
           | ( class_declaration_extensibility CLASS identifier type_params class_declaration_extends LEFT_CURLY_BRACE ( declaration | mix_declaration )* | MIXIN identifier type_params mixin_import LEFT_CURLY_BRACE ( function_declaration | mix_declaration )* | CHOICE identifier type_params LEFT_CURLY_BRACE alternative_list | INTERFACE identifier type_params LEFT_CURLY_BRACE ( function_declaration | LET generic_binding SEMICOLON )* | impl_kind IMPL impl_deduced_params impl_type AS type_or_where_expression LEFT_CURLY_BRACE ( function_declaration | alias_declaration )* ) RIGHT_CURLY_BRACE
           | ( VAR variable_declaration ( EQUAL expression )? | LET variable_declaration EQUAL expression ) SEMICOLON
           | alias_declaration
impl_kind
         ::= EXTERNAL?
impl_type
         ::= type_expression?
destructor_declaration
         ::= DESTRUCTOR deduced_params block

// Tokens

//\(\S+\)\s+\(\S+\)
/* table-begin */
ABSTRACT ::= "abstract"
ADDR ::= "addr"
ALIAS ::= "alias"
AMPERSAND ::= "&"
AND ::= "and"
API ::= "api"
ARROW ::= "->"
AS ::= "as"
AUTO ::= "auto"
AWAIT ::= "__await"
BASE ::= "base"
BOOL ::= "bool"
BREAK ::= "break"
CARET ::= "^"
CASE ::= "case"
CHOICE ::= "choice"
CLASS ::= "class"
COLON ::= ":"
COLON_BANG ::= ":!"
COMMA ::= ","
CONTINUATION ::= "__continuation"
CONTINUATION_TYPE ::= "__Continuation"
CONTINUE ::= "continue"
DEFAULT ::= "default"
DESTRUCTOR ::= "destructor"
DOUBLE_ARROW ::= "=>"
ELSE ::= "else"
EQUAL ::= "="
EQUAL_EQUAL ::= "=="
EXTENDS ::= "extends"
EXTERNAL ::= "external"
FALSE ::= "false"
FN ::= "fn"
FN_TYPE ::= "__Fn"
FOR ::= "for"
FORALL ::= "forall"
GREATER ::= ">"
GREATER_EQUAL ::= ">="
GREATER_GREATER ::= ">>"
IF ::= "if"
IMPL ::= "impl"
IMPORT ::= "import"
IN ::= "in"
INTERFACE ::= "interface"
IS ::= "is"
LEFT_CURLY_BRACE ::= "{"
LEFT_PARENTHESIS ::= "("
LEFT_SQUARE_BRACKET ::= "["
LESS ::= "<"
LESS_EQUAL ::= "<="
LESS_LESS ::= "<<"
LET ::= "let"
LIBRARY ::= "library"
MATCH ::= "match"
MINUS ::= "-"
MIX ::= "__mix"
MIXIN ::= "__mixin"
NOT ::= "not"
NOT_EQUAL ::= "!="
OR ::= "or"
PACKAGE ::= "package"
PERCENT ::= "%"
PERIOD ::= "."
PIPE ::= "|"
PLUS ::= "+"
RETURN ::= "return"
RETURNED ::= "returned"
RIGHT_CURLY_BRACE ::= "}"
RIGHT_PARENTHESIS ::= ")"
RIGHT_SQUARE_BRACKET ::= "]"
RUN ::= "__run"
SELF ::= "Self"
SEMICOLON ::= ";"
SLASH ::= "/"
STRING ::= "String"
THEN ::= "then"
TRUE ::= "true"
TYPE ::= "Type"
UNDERSCORE ::= "_"
UNIMPL_EXAMPLE ::= "__unimplemented_example_infix"
VAR ::= "var"
WHERE ::= "where"
WHILE ::= "while"
/* table-end */
mingodad commented 1 year ago

Trying to create an LL(1) parser with https://mingodad.github.io/CocoR-Typescript (and family) I found that this definition is problematic:

sized_type_literal    [iuf][1-9][0-9]*

It's ambiguous with this:

identifier            [A-Za-z_][A-Za-z0-9_]*

So when we see something like shown bellow, both rules apply and without context sensitive parsing/lexing how to know if they are identifiers or sized_type_literals ?

i3 || u4 || f5

I would prefer to have it postfixed like:

sized_type_literal   [1-9][0-9]*  [iuf]
mingodad commented 1 year ago

Also there isn't any definition for floating point numbers .

geoffromer commented 1 year ago

The Bison grammar probably isn't going to be very useful as a source-of-truth, because some of our lexing and parsing rules are pretty awkward to express in Bison, especially around the partial precedence order and the way we disambiguate *. For the same reason, I'm not sure how useful railroad diagrams will be as documentation (but I don't know much about them).

As for the issue with the grammar not being LL(1) because of sized_type_literal, I think you should open a separate issue for that, because it's an issue with the language itself, unrelated to railroad diagrams.

JamesJCode commented 1 year ago

@geoffromer Is the aim for the grammar to be LL(1)? I can see reference to "finite bounded lookahead" and "context free", i.e. wanting some flavour of LL(k) / LR(k) (and associated families) parsing, but no explicit requirement of LL(1). It would be useful to spell that out expliticly if it is a design goal / restriction.

JamesJCode commented 1 year ago

@geoffromer Is the aim for the grammar to be LL(1)? I can see reference to "finite bounded lookahead" and "context free", i.e. wanting some flavour of LL(k) / LR(k) (and associated families) parsing, but no explicit requirement of LL(1). It would be useful to spell that out expliticly if it is a design goal / restriction.

Moving reply from Discord for the record, from @chandlerc:

"You can't really parse infix binary operators with an LL grammar, it needs to be LR. We'd like to be LALR(1), and would be very reluctant to give up on LALR(k)."

mingodad commented 1 year ago

It was an experiment because I was thinking that Carbon would try to cut as much ambiguity as possible and maybe even be able to be parsed by LL(1) parser but it seems that's not the case. Anyway thank you for all you help/feedback !

chandlerc commented 1 year ago

We can still generate RR diagrams for the LR(1) grammar? Is this issue still useful?

mingodad commented 1 year ago

I think so, to have a clean global view of the grammar without the code.

JamesJCode commented 1 year ago

FWIW I think they are useful, and as the aim is for the language to have a formal specification, constructing it could end up just being part of the CI process.

There's also an argument based on inclusiveness which chimes with Carbon's goals - some people find reasoning about visual representations much easier than more abstract textual ones.

mingodad commented 1 year ago

And don't forget that the railroad diagrams generated by https://www.bottlecaps.de/rr/ui are navigable and cross referenced.

mingodad commented 1 year ago

I have updated the EBNF to be viewed at https://www.bottlecaps.de/rr/ui with the grammar on 2022-10-09.

tdjastrzebski commented 1 year ago

I came across this thread wondering where/how Carbon grammar was defined. Would not it be a good idea to use ANTLR lexer/parser generator? As far as I can tell, it is the best on this planet.

geoffromer commented 1 year ago

The Explorer is currently using Flex and Bison (see lexer.lpp and parser.ypp), I think mostly because they're widely available. I don't think anybody's looked much into the costs and benefits of switching to ANTLR.

The production compiler is using a hand-rolled lexer and parser. I believe that's in order to maximize efficiency and diagnostic quality, but I wasn't closely involved in those discussions, so I may not have that right.

tdjastrzebski commented 1 year ago

@geoffromer Thanks, I guess, I was just curious. I believe there were important reasons why classic Flex/Bison was chosen over way more modern and highly maintained ANTLR, but I do not see in what way Flex/Bison is more available. I agree, at this early stage of the project changing lexer/parser may be pointless and seem relatively expensive. I would just keep ANTLR on the list of alternatives in case one day Flex/Bison turns out to be too limited and.. it is still not too late. Here is a good discussion of this subject: Why you should not use (f)lex, yacc and bison

github-actions[bot] commented 1 year ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time. This issue is labeled inactive because the last activity was over 90 days ago.

mingodad commented 10 months ago

I'm also working to achieve a LALR(1)/LEX to try grammars online with wasm based on https://github.com/BenHanson/gram_grep and I've got the carbon-lang grammar working, view it here https://mingodad.github.io/parsertl-playground/playground/ select Carbon parser (need review of *) from the examples, you can edit the Grammar or the Input source and press Parse to see a parser tree.

I hope it can be a nice tool to experiment with LALR(1)/LEX grammars with instant feedback !

github-actions[bot] commented 7 months ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time.

This issue is labeled inactive because the last activity was over 90 days ago.