cenotelie / hime

Apache License 2.0
27 stars 4 forks source link

How to print multiple ASTs for GLR parser correctly? #88

Open stevefan1999-personal opened 11 months ago

stevefan1999-personal commented 11 months ago

I'm trying to implement a C parser using hime based on the grammar here (https://www.quut.com/c/ANSI-C-grammar-y-2011.html) and here (https://www.quut.com/c/ANSI-C-grammar-l-2011.html), and for this particular input:

    typedef char T;
    int main() {
        T * *hello;
    }

There should be multiple parsing trees (a primary expression T multiply pointer dereference of "hello" or a double pointer declaration of type T named "hello"), but I realized that the default function to print AST:

fn print(node: AstNode, crossings: &[bool]) {
    let mut i = 0;
    if !crossings.is_empty() {
        while i < crossings.len() - 1 {
            print!("{:}", if crossings[i] { "|   " } else { "    " });
            i += 1;
        }
        print!("+-> ");
    }
    println!("{node}");
    i = 0;
    let children = node.children();
    while i < children.len() {
        let mut child_crossings = crossings.to_owned();
        child_crossings.push(i < children.len() - 1);
        print(children.at(i), &child_crossings);
        i += 1;
    }
}

Just prints the latter which is type definition. This is correct but only part of the truth. Maybe this function is not sufficient?

C grammar:

grammar C
{
    options
    {
        Axiom = "translation_unit";
        Separator = "SEPARATOR";
    }
    terminals
    {
        // A.1.1 Line terminators
        NEW_LINE        -> U+000D /* CR */
                        |  U+000A /* LF */
                        |  U+000D U+000A /* CR LF */
                        |  U+0085 // Next line character
                        |  U+2028 // Line separator character
                        |  U+2029 ; //Paragraph separator character (U+2029)

        // A.1.2 White space
        WHITE_SPACE        -> uc{Zs} | U+0009 | U+000B | U+000C ;

        // A.1.3 Comments
        COMMENT_LINE    -> '//' (.* - (.* NEW_LINE .*)) ;
        COMMENT_BLOCK    -> '/*' (.* - (.* '*/' .*)) '*/' ;

        // A.1.6 Identifiers
        // fragment IDENTIFIER_CHAR        -> uc{Lu} | uc{Ll} | uc{Lt} | uc{Lm} | uc{Lo} | uc{Nl} ;
        // IDENTIFIER            -> (IDENTIFIER_CHAR | '_') (IDENTIFIER_CHAR | '_' | uc{Nd} | uc{Pc} | uc{Cf})* ;
        IDENTIFIER            ->  [a-zA-Z_] [a-zA-Z0-9_]* ;

        // A.1.8 Literals
        INTEGER_LITERAL_DECIMAL        -> ('0' | [1-9] [0-9]*) ([Uu] [Ll]? | [Ll] [Uu]? )? ;
        INTEGER_LITERAL_HEXA        -> '0' [xX] [a-fA-F0-9]+ ([Uu] [Ll]? | [Ll] [Uu]? )? ;

        fragment EXPONENT -> [eE] ('+'|'-')? ('0' | [1-9] [0-9]*) ;
        fragment REAL_LITERAL_SUFFIX -> [FfDdMm] ;
        REAL_LITERAL                -> ('0' | [1-9] [0-9]*)? '.' ('0' | [1-9] [0-9]*) EXPONENT? REAL_LITERAL_SUFFIX?
                                    |  ('0' | [1-9] [0-9]*) EXPONENT  REAL_LITERAL_SUFFIX?
                                    |  ('0' | [1-9] [0-9]*) REAL_LITERAL_SUFFIX;
        fragment HEX_ESCAPE_LITERAL -> '\\' 'x' [a-fA-F0-9]{1,4}
                                            | '\\' [uU] [a-fA-F0-9]{4} ([a-fA-F0-9]{4})? ;
        CHARACTER_LITERAL            -> '\'' ( (. - ('\'' | '\\' | NEW_LINE))
                                            | '\\' ('\'' | '"' | '\\' | [0abfnrtv])
                                            | HEX_ESCAPE_LITERAL
                                        ) '\'' ;
        STRING_LITERAL        -> '"'  ( (. - ('"' | '\\' | NEW_LINE))
                                            | '\\' ('\'' | '"' | '\'' | '\\' | [0abfnrtv])
                                            | HEX_ESCAPE_LITERAL
                                        )* '"' ;

        SEPARATOR        -> (NEW_LINE | WHITE_SPACE | COMMENT_LINE | COMMENT_BLOCK)+;
    }
    rules
    {
        primary_expression
            -> IDENTIFIER
            | constant
            | string
            | '(' expression ')'
            | generic_selection
            ;

        constant
            -> INTEGER_LITERAL_DECIMAL
            | INTEGER_LITERAL_HEXA
            | REAL_LITERAL
            | IDENTIFIER    /* after it has been defined as such */
            ;

        enumeration_constant        /* before it has been defined as such */
            -> IDENTIFIER
            ;

        string
            -> STRING_LITERAL
            | '__func__'
            ;

        generic_selection
            -> '_Generic' '(' assignment_expression ',' generic_assoc_list ')'
            ;

        generic_assoc_list
            -> generic_association
            | generic_assoc_list ',' generic_association
            ;

        generic_association
            -> type_name ':' assignment_expression
            | 'default' ':' assignment_expression
            ;

        postfix_expression
            -> primary_expression
            | postfix_expression '[' expression ']'
            | postfix_expression '(' ')'
            | postfix_expression '(' argument_expression_list ')'
            | postfix_expression '.' IDENTIFIER
            | postfix_expression '->' IDENTIFIER
            | postfix_expression '++'
            | postfix_expression '--'
            | '(' type_name ')' '{' initializer_list '}'
            | '(' type_name ')' '{' initializer_list ',' '}'
            ;

        argument_expression_list
            -> assignment_expression
            | argument_expression_list ',' assignment_expression
            ;

        unary_expression
            -> postfix_expression
            | '++' unary_expression
            | '--' unary_expression
            | unary_operator cast_expression
            | 'sizeof' unary_expression
            | 'sizeof' '(' type_name ')'
            | '_Alignof' '(' type_name ')'
            ;

        unary_operator
            -> '&'
            | '*'
            | '+'
            | '-'
            | '~'
            | '!'
            ;

        cast_expression
            -> unary_expression
            | '(' type_name ')' cast_expression
            ;

        multiplicative_expression
            -> cast_expression
            | multiplicative_expression '*' cast_expression
            | multiplicative_expression '/' cast_expression
            | multiplicative_expression '%' cast_expression
            ;

        additive_expression
            -> multiplicative_expression
            | additive_expression '+' multiplicative_expression
            | additive_expression '-' multiplicative_expression
            ;

        shift_expression
            -> additive_expression
            | shift_expression '<<' additive_expression
            | shift_expression '>>' additive_expression
            ;

        relational_expression
            -> shift_expression
            | relational_expression '<' shift_expression
            | relational_expression '>' shift_expression
            | relational_expression '<=' shift_expression
            | relational_expression '>=' shift_expression
            ;

        equality_expression
            -> relational_expression
            | equality_expression '==' relational_expression
            | equality_expression '!=' relational_expression
            ;

        and_expression
            -> equality_expression
            | and_expression '&' equality_expression
            ;

        exclusive_or_expression
            -> and_expression
            | exclusive_or_expression '^' and_expression
            ;

        inclusive_or_expression
            -> exclusive_or_expression
            | inclusive_or_expression '|' exclusive_or_expression
            ;

        logical_and_expression
            -> inclusive_or_expression
            | logical_and_expression '&&' inclusive_or_expression
            ;

        logical_or_expression
            -> logical_and_expression
            | logical_or_expression '||' logical_and_expression
            ;

        conditional_expression
            -> logical_or_expression
            | logical_or_expression '?' expression ':' conditional_expression
            ;

        assignment_expression
            -> conditional_expression
            | unary_expression assignment_operator assignment_expression
            ;

        assignment_operator
            -> '='
            | '*='
            | '/='
            | '%='
            | '+='
            | '-='
            | '<<='
            | '>>='
            | '&='
            | '^='
            | '|='
            ;

        expression
            -> assignment_expression
            | expression ',' assignment_expression
            ;

        constant_expression
            -> conditional_expression    /* with constraints */
            ;

        declaration
            -> declaration_specifiers ';'
            | declaration_specifiers init_declarator_list ';'
            | static_assert_declaration
            ;

        declaration_specifiers
            -> storage_class_specifier declaration_specifiers
            | storage_class_specifier
            | type_specifier declaration_specifiers
            | type_specifier
            | type_qualifier declaration_specifiers
            | type_qualifier
            | function_specifier declaration_specifiers
            | function_specifier
            | alignment_specifier declaration_specifiers
            | alignment_specifier
            ;

        init_declarator_list
            -> init_declarator
            | init_declarator_list ',' init_declarator
            ;

        init_declarator
            -> declarator '=' initializer
            | declarator
            ;

        storage_class_specifier
            -> 'typedef'    /* identifiers must be flagged as TYPEDEF_NAME */
            | 'extern'
            | 'static'
            | '_Thread_local'
            | 'auto'
            | 'register'
            ;

        type_specifier
            -> 'void'
            | 'char'
            | 'short'
            | 'int'
            | 'long'
            | 'float'
            | 'double'
            | 'signed'
            | 'unsigned'
            | '_Bool'
            | '_Complex'
            | '_Imaginary'          /* non-mandated extension */
            | atomic_type_specifier
            | struct_or_union_specifier
            | enum_specifier
            | IDENTIFIER        /* after it has been defined as such */
            ;

        struct_or_union_specifier
            -> struct_or_union '{' struct_declaration_list '}'
            | struct_or_union IDENTIFIER '{' struct_declaration_list '}'
            | struct_or_union IDENTIFIER
            ;

        struct_or_union
            -> 'struct'
            | 'union'
            ;

        struct_declaration_list
            -> struct_declaration
            | struct_declaration_list struct_declaration
            ;

        struct_declaration
            -> specifier_qualifier_list ';'    /* for anonymous struct/union */
            | specifier_qualifier_list struct_declarator_list ';'
            | static_assert_declaration
            ;

        specifier_qualifier_list
            -> type_specifier specifier_qualifier_list
            | type_specifier
            | type_qualifier specifier_qualifier_list
            | type_qualifier
            ;

        struct_declarator_list
            -> struct_declarator
            | struct_declarator_list ',' struct_declarator
            ;

        struct_declarator
            -> ':' constant_expression
            | declarator ':' constant_expression
            | declarator
            ;

        enum_specifier
            -> 'enum' '{' enumerator_list '}'
            | 'enum' '{' enumerator_list ',' '}'
            | 'enum' IDENTIFIER '{' enumerator_list '}'
            | 'enum' IDENTIFIER '{' enumerator_list ',' '}'
            | 'enum' IDENTIFIER
            ;

        enumerator_list
            -> enumerator
            | enumerator_list ',' enumerator
            ;

        enumerator    /* identifiers must be flagged as ENUMERATION_CONSTANT */
            -> enumeration_constant '=' constant_expression
            | enumeration_constant
            ;

        atomic_type_specifier
            -> '_Atomic' '(' type_name ')'
            ;

        type_qualifier
            -> 'const'
            | 'restrict'
            | 'volatile'
            | '_Atomic'
            ;

        function_specifier
            -> 'inline'
            | '_Noreturn'
            ;

        alignment_specifier
            -> '_Alignas' '(' type_name ')'
            | '_Alignas' '(' constant_expression ')'
            ;

        declarator
            -> pointer direct_declarator
            | direct_declarator
            ;

        direct_declarator
            -> IDENTIFIER
            | '(' declarator ')'
            | direct_declarator '[' ']'
            | direct_declarator '[' '*' ']'
            | direct_declarator '[' 'static' type_qualifier_list assignment_expression ']'
            | direct_declarator '[' 'static' assignment_expression ']'
            | direct_declarator '[' type_qualifier_list '*' ']'
            | direct_declarator '[' type_qualifier_list 'static' assignment_expression ']'
            | direct_declarator '[' type_qualifier_list assignment_expression ']'
            | direct_declarator '[' type_qualifier_list ']'
            | direct_declarator '[' assignment_expression ']'
            | direct_declarator '(' parameter_type_list ')'
            | direct_declarator '(' ')'
            | direct_declarator '(' identifier_list ')'
            ;

        pointer
            -> '*' type_qualifier_list pointer
            | '*' type_qualifier_list
            | '*' pointer
            | '*'
            ;

        type_qualifier_list
            -> type_qualifier
            | type_qualifier_list type_qualifier
            ;

        parameter_type_list
            -> parameter_list ',' '...'
            | parameter_list
            ;

        parameter_list
            -> parameter_declaration
            | parameter_list ',' parameter_declaration
            ;

        parameter_declaration
            -> declaration_specifiers declarator
            | declaration_specifiers abstract_declarator
            | declaration_specifiers
            ;

        identifier_list
            -> IDENTIFIER
            | identifier_list ',' IDENTIFIER
            ;

        type_name
            -> specifier_qualifier_list abstract_declarator
            | specifier_qualifier_list
            ;

        abstract_declarator
            -> pointer direct_abstract_declarator
            | pointer
            | direct_abstract_declarator
            ;

        direct_abstract_declarator
            -> '(' abstract_declarator ')'
            | '[' ']'
            | '[' '*' ']'
            | '[' 'static' type_qualifier_list assignment_expression ']'
            | '[' 'static' assignment_expression ']'
            | '[' type_qualifier_list 'static' assignment_expression ']'
            | '[' type_qualifier_list assignment_expression ']'
            | '[' type_qualifier_list ']'
            | '[' assignment_expression ']'
            | direct_abstract_declarator '[' ']'
            | direct_abstract_declarator '[' '*' ']'
            | direct_abstract_declarator '[' 'static' type_qualifier_list assignment_expression ']'
            | direct_abstract_declarator '[' 'static' assignment_expression ']'
            | direct_abstract_declarator '[' type_qualifier_list assignment_expression ']'
            | direct_abstract_declarator '[' type_qualifier_list 'static' assignment_expression ']'
            | direct_abstract_declarator '[' type_qualifier_list ']'
            | direct_abstract_declarator '[' assignment_expression ']'
            | '(' ')'
            | '(' parameter_type_list ')'
            | direct_abstract_declarator '(' ')'
            | direct_abstract_declarator '(' parameter_type_list ')'
            ;

        initializer
            -> '{' initializer_list '}'
            | '{' initializer_list ',' '}'
            | assignment_expression
            ;

        initializer_list
            -> designation initializer
            | initializer
            | initializer_list ',' designation initializer
            | initializer_list ',' initializer
            ;

        designation
            -> designator_list '='
            ;

        designator_list
            -> designator
            | designator_list designator
            ;

        designator
            -> '[' constant_expression ']'
            | '.' IDENTIFIER
            ;

        static_assert_declaration
            -> '_Static_assert' '(' constant_expression ',' STRING_LITERAL ')' ';'
            ;

        statement
            -> labeled_statement
            | compound_statement
            | expression_statement
            | selection_statement
            | iteration_statement
            | jump_statement
            ;

        labeled_statement
            -> IDENTIFIER ':' statement
            | 'case' constant_expression ':' statement
            | 'default' ':' statement
            ;

        compound_statement
            -> '{' '}'
            | '{'  block_item_list '}'
            ;

        block_item_list
            -> block_item
            | block_item_list block_item
            ;

        block_item
            -> declaration
            | statement
            ;

        expression_statement
            -> ';'
            | expression ';'
            ;

        selection_statement
            -> 'if' '(' expression ')' statement 'else' statement
            | 'if' '(' expression ')' statement
            | 'switch' '(' expression ')' statement
            ;

        iteration_statement
            -> 'while' '(' expression ')' statement
            | 'do' statement 'while' '(' expression ')' ';'
            | 'for' '(' expression_statement expression_statement ')' statement
            | 'for' '(' expression_statement expression_statement expression ')' statement
            | 'for' '(' declaration expression_statement ')' statement
            | 'for' '(' declaration expression_statement expression ')' statement
            ;

        jump_statement
            -> 'goto' IDENTIFIER ';'
            | 'continue' ';'
            | 'break' ';'
            | 'return' ';'
            | 'return' expression ';'
            ;

        translation_unit
            -> external_declaration
            | translation_unit external_declaration
            ;

        external_declaration
            -> function_definition
            | declaration
            ;

        function_definition
            -> declaration_specifiers declarator declaration_list compound_statement
            | declaration_specifiers declarator compound_statement
            ;

        declaration_list
            -> declaration
            | declaration_list declaration
            ;

    }
}
woutersl commented 11 months ago

It is true that in the case of GLR parsing Hime maintains a collection of concurrent parse trees (Hime use Shared Packed Parse Forest, SPPF) while there are ambiguities, but only one version is returned in the end, arbitrarily the first found. From a user perspective, the result returned by Hime only represent this single AST at this time.

If returning the full SPPF is desirable, we could hammer something out, the data is there, it's a question of exposing it.

stevefan1999-personal commented 11 months ago

It is true that in the case of GLR parsing Hime maintains a collection of concurrent parse trees (Hime use Shared Packed Parse Forest, SPPF) while there are ambiguities, but only one version is returned in the end, arbitrarily the first found. From a user perspective, the result returned by Hime only represent this single AST at this time.

If returning the full SPPF is desirable, we could hammer something out, the data is there, it's a question of exposing it.

Yes, it is desirable, consider this example:

    int main() {
        int T, a, *z, y;
        T * a;
        int x = y / *z;
    }

As you could see T * a is supposed to be a multiplication expression (because both T and a are declared int), but the AST in Hime is "misinterpreting" it as a pointer declaration here.

I have an idea that we could iterate the "AST"s, maintain another stack to store the type information, and delete unfavorable AST and finally come up with a valid representation.

    int main() {
        int T, a, *z, y; // KNOWN: T, a, y is int, z is int*
        T * a; // Two ASTs here: (multiply T a) and (ptr-decl T a), but we know the latter is impossible due to the symbol T being int.
        int x = y / *z;
    }

So until Hime can have runtime feedback to the parsing state, we need the forest so that we can further prune out invalid trees after parsing using GLR, and do so during semantic verification phase. Do note this phase can also be done in parallel.

This also seems to be done by Elkhound as well

stevefan1999-personal commented 11 months ago

I tried the new SPPF production rules, it seems like it does not work well with repeat statement?

compound_statement -> '{'! (statement | declaration)* '}'!;

Gives only one version (declaration derivation), but this:

        compound_statement
            -> '{' '}'
            | '{'  block_item_list '}'
            ;

        block_item_list
            -> block_item
            | block_item_list block_item
            ;

        block_item
            -> declaration
            | statement
            ;

Correctly gives two versions.

Here's my optimized version of the C grammar if you want:

grammar C
{
    options
    {
        Axiom = "translation_unit";
        Separator = "SEPARATOR";
    }
    terminals
    {
        // A.1.1 Line terminators
        NEW_LINE        -> U+000D /* CR */
                        |  U+000A /* LF */
                        |  U+000D U+000A /* CR LF */
                        |  U+0085 // Next line character
                        |  U+2028 // Line separator character
                        |  U+2029 ; //Paragraph separator character (U+2029)

        // A.1.2 White space
        WHITE_SPACE     -> uc{Zs} | U+0009 | U+000B | U+000C ;

        // A.1.3 Comments
        COMMENT_LINE    -> '//' (.* - (.* NEW_LINE .*)) ;
        COMMENT_BLOCK   -> '/*' (.* - (.* '*/' .*)) '*/' ;

        // A.1.6 Identifiers
        // fragment IDENTIFIER_CHAR     -> uc{Lu} | uc{Ll} | uc{Lt} | uc{Lm} | uc{Lo} | uc{Nl} ;
        // IDENTIFIER           -> (IDENTIFIER_CHAR | '_') (IDENTIFIER_CHAR | '_' | uc{Nd} | uc{Pc} | uc{Cf})* ;
        IDENTIFIER          ->  [a-zA-Z_] [a-zA-Z0-9_]* ;

        // A.1.8 Literals
        INTEGER_LITERAL_DECIMAL     -> ('0' | [1-9] [0-9]*) ([Uu] [Ll]? | [Ll] [Uu]? )? ;
        INTEGER_LITERAL_HEXA        -> '0' [xX] [a-fA-F0-9]+ ([Uu] [Ll]? | [Ll] [Uu]? )? ;

        fragment EXPONENT -> [eE] ('+'|'-')? ('0' | [1-9] [0-9]*) ;
        fragment REAL_LITERAL_SUFFIX -> [FfDdMm] ;
        REAL_LITERAL                -> ('0' | [1-9] [0-9]*)? '.' ('0' | [1-9] [0-9]*) EXPONENT? REAL_LITERAL_SUFFIX?
                                    |  ('0' | [1-9] [0-9]*) EXPONENT  REAL_LITERAL_SUFFIX?
                                    |  ('0' | [1-9] [0-9]*) REAL_LITERAL_SUFFIX;
        fragment HEX_ESCAPE_LITERAL -> '\\' 'x' [a-fA-F0-9]{1,4}
                                            | '\\' [uU] [a-fA-F0-9]{4} ([a-fA-F0-9]{4})? ;
        CHARACTER_LITERAL           -> '\'' ( (. - ('\'' | '\\' | NEW_LINE))
                                            | '\\' ('\'' | '"' | '\\' | [0abfnrtv])
                                            | HEX_ESCAPE_LITERAL
                                        ) '\'' ;
        STRING_LITERAL      -> '"'  ( (. - ('"' | '\\' | NEW_LINE))
                                            | '\\' ('\'' | '"' | '\'' | '\\' | [0abfnrtv])
                                            | HEX_ESCAPE_LITERAL
                                        )* '"' ;

        SEPARATOR       -> (NEW_LINE | WHITE_SPACE | COMMENT_LINE | COMMENT_BLOCK)+;
    }
    rules
    {

        comma_repeated<rule> -> rule^ (','! rule)*;
        or<a, b> -> a b | a | b;

        primary_expression -> 
              IDENTIFIER
            | constant
            | string
            | '('! expression^ ')'!
            | generic_selection
        ;

        constant -> 
              INTEGER_LITERAL_DECIMAL
            | INTEGER_LITERAL_HEXA
            | REAL_LITERAL
            | IDENTIFIER    /* after it has been defined as such */
        ;

        enumeration_constant -> IDENTIFIER /* before it has been defined as such */ ;

        string
            -> STRING_LITERAL
            | '__func__'
            ;

        generic_selection -> '_Generic'! '('! assignment_expression ','! comma_repeated<generic_association> ')'!;

        generic_association -> (type_name | 'default') ':'! assignment_expression;

        postfix_expression -> 
              primary_expression^
            | postfix_expression '['! expression ']'!
            | postfix_expression '('! comma_repeated<assignment_expression>? ')'!
            | postfix_expression ('.'|'->')^ IDENTIFIER
            | postfix_expression ('++'|'--')
            | '('! type_name ')'! initializer_block
        ;

        unary_expression -> 
              postfix_expression^
            | ('++'|'--')^ unary_expression
            | ('&'| '*'| '+'| '-'| '~'| '!')^ cast_expression
            | 'sizeof' unary_expression
            | 'sizeof' '('! type_name ')'!
            | '_Alignof' '('! type_name ')'!
        ;

        cast_expression -> unary_expression^ | '('! type_name ')'! cast_expression;
        multiplicative_expression -> cast_expression^ (('*' | '/' | '%')^ cast_expression)*;
        additive_expression -> multiplicative_expression^ (('+' | '-')^ multiplicative_expression)*;
        shift_expression -> additive_expression^ (('<<' | '>>')^ additive_expression)*;
        relational_expression -> shift_expression^ (('<' | '>' | '<=' | '>=')^ shift_expression)*;
        equality_expression -> relational_expression^ (('==' | '!=')^ relational_expression)*;
        and_expression -> equality_expression^ ('&'^ equality_expression)*;
        exclusive_or_expression -> and_expression^ ('^'^ and_expression)*;
        inclusive_or_expression -> exclusive_or_expression^ ('|'^ exclusive_or_expression)*;
        logical_and_expression -> inclusive_or_expression^ ('&&'^ inclusive_or_expression)*;
        logical_or_expression -> logical_and_expression^ ('||'^ logical_and_expression)*;
        conditional_expression -> logical_or_expression^ ('?' expression ':' conditional_expression)?;

        assignment_expression -> 
              conditional_expression^
            | unary_expression assignment_operator^ assignment_expression
        ;

        assignment_operator -> (
              '='
            | '*='
            | '/='
            | '%='
            | '+='
            | '-='
            | '<<='
            | '>>='
            | '&='
            | '^='
            | '|='
        )^;

        expression -> comma_repeated<assignment_expression>;

        constant_expression -> conditional_expression   /* with constraints */;

        init_declarator -> declarator ('='! initializer)?;
        declaration -> 
              declaration_specifiers comma_repeated<init_declarator>? ';'!
            | static_assert_declaration
        ;

        declaration_specifiers -> (storage_class_specifier | type_specifier | type_qualifier | function_specifier | alignment_specifier)+;

        storage_class_specifier -> 
              'typedef' /* identifiers must be flagged as TYPEDEF_NAME */
            | 'extern'
            | 'static'
            | '_Thread_local'
            | 'auto'
            | 'register'
        ;

        type_specifier -> 
              'void'
            | 'char'
            | 'short'
            | 'int'
            | 'long'
            | 'float'
            | 'double'
            | 'signed'
            | 'unsigned'
            | '_Bool'
            | '_Complex'
            | '_Imaginary'      /* non-mandated extension */
            | atomic_type_specifier
            | struct_or_union_specifier
            | enum_specifier
            | IDENTIFIER        /* after it has been defined as such */
        ;

        struct_or_union_specifier -> 
              struct_or_union '{'! struct_declaration+ '}'!
            | struct_or_union IDENTIFIER ('{'! struct_declaration+ '}'!)?
        ;

        struct_or_union -> 'struct' | 'union';

        struct_declaration -> 
              specifier_qualifier_list comma_repeated<struct_declarator>? /* for anonymous struct/union */ ';'!
            | static_assert_declaration
        ;

        specifier_qualifier_list -> (type_specifier | type_qualifier)+;

        bitfield_declarator -> ':'! constant_expression;
        struct_declarator -> 
              bitfield_declarator
            | declarator bitfield_declarator
            | declarator
        ;

        enum_block -> '{'! comma_repeated<enumerator> ','!? '}'!;
        enum_specifier -> 
              'enum' enum_block
            | 'enum' IDENTIFIER enum_block?
        ;

        enumerator  /* identifiers must be flagged as ENUMERATION_CONSTANT */
            -> enumeration_constant ('='! constant_expression)?
            ;

        atomic_type_specifier -> '_Atomic' '('! type_name ')'!;

        type_qualifier -> 'const' | 'restrict' | 'volatile' | '_Atomic' ;

        function_specifier -> 'inline' | '_Noreturn';

        alignment_specifier -> '_Alignas' '('! (type_name | constant_expression) ')'! ;

        declarator -> pointer? direct_declarator;

        direct_declarator_accessor ->
              '['! type_qualifier* '*'? ']'!
            | '['! 'static' type_qualifier* assignment_expression ']'!
            | '['! type_qualifier+ 'static'? assignment_expression ']'!
            | '['! (type_qualifier+ | assignment_expression)? ']'!
            | '('! (parameter_type_list | comma_repeated<IDENTIFIER>)? ')'!
        ;
        direct_declarator
            -> IDENTIFIER
            | '('! declarator ')'!
            | direct_declarator direct_declarator_accessor
            ;

        pointer -> ('*'! type_qualifier*)+;

        parameter_type_list -> comma_repeated<parameter_declaration> (','! '...')?;

        parameter_declaration -> declaration_specifiers (declarator | abstract_declarator)?;

        type_name -> specifier_qualifier_list abstract_declarator?;

        abstract_declarator -> or<pointer, direct_abstract_declarator>;

        direct_abstract_declarator_accessor ->
              '['! type_qualifier* '*'? ']'!
            | '['! 'static' type_qualifier* assignment_expression ']'!
            | '['! type_qualifier+ 'static'? assignment_expression ']'!
            | '['! assignment_expression ']'!
        ;

        direct_abstract_declarator -> 
              '('! abstract_declarator ')'!
            | (direct_abstract_declarator_accessor | '('! parameter_type_list? ')'!)*
        ;

        initializer_block -> '{'! comma_repeated<initializer_expression>? ','!? '}'!;
        initializer_expression -> (designator+ '='!)? initializer;
        initializer -> initializer_block | assignment_expression;

        designator -> '['! constant_expression ']'! | '.'! IDENTIFIER;

        static_assert_declaration -> '_Static_assert'! '('! constant_expression ','! STRING_LITERAL ')'! ';'! ;

        statement -> 
              labeled_statement
            | compound_statement
            | expression_statement
            | selection_statement
            | iteration_statement
            | jump_statement
        ;

        labeled_statement -> (IDENTIFIER | 'case' constant_expression | 'default') ':'! statement;
        compound_statement -> '{'! (declaration | statement)* '}'!;
        expression_statement -> expression? ';'!;

        selection_statement -> 
              'if' '('! expression ')'! statement ('else' statement)?
            | 'switch' '('! expression ')'! statement
        ;

        iteration_statement -> 
              'while' '('! expression ')'! statement
            | 'do' statement 'while' '('! expression ')'! ';'!
            | 'for' '('! (expression_statement | declaration) expression_statement expression? ')'! statement
        ;

        jump_statement -> 
              'goto' IDENTIFIER ';'!
            | 'continue' ';'!
            | 'break' ';'!
            | 'return' ';'!
            | 'return' expression ';'!
        ;

        translation_unit -> external_declaration+;
        external_declaration -> function_definition | declaration;
        function_definition -> declaration_specifiers declarator declaration* compound_statement;

    }
}
woutersl commented 11 months ago

Yes, there still are some issues to iron out. Thing is, the previous code was geared toward the construction of a single AST so the way tree manipulations (promotion primarily with ^) happen have some finicky details to get right.

In your example, with * you have an implicitly generated variable which is auto-replaced in the tree by its children. I'll try to get a simpler reproducible example for this.

woutersl commented 11 months ago

By the way, here is the code which is able to print the different versions of the SPPF:

use hime_redist::{ast::AstNode, sppf::SppfNode};

fn print_crossings(crossings: &[bool]) {
    let mut i = 0;
    if !crossings.is_empty() {
        while i < crossings.len() - 1 {
            print!("{:}", if crossings[i] { "|   " } else { "    " });
            i += 1;
        }
        print!("+-> ");
    }
}

fn print_sppf(node: SppfNode, crossings: &[bool]) {
    if node.versions_count() == 1 {
        print_single_version(node, crossings);
    } else {
        print_multi_versions(node, crossings);
    }
}

fn print_multi_versions(node: SppfNode, crossings: &[bool]) {
    print_crossings(crossings);
    println!("<versions>");

    for (index, version) in node.versions().into_iter().enumerate() {
        let mut crossings = crossings.to_owned();
        crossings.push(index < node.versions_count() - 1);
        print_crossings(&crossings);
        println!("{version}");
        let mut i = 0;
        let children = version.children();
        while i < children.len() {
            let mut child_crossings = crossings.to_owned();
            child_crossings.push(i < children.len() - 1);
            print_sppf(children.at(i), &child_crossings);
            i += 1;
        }
    }
}

fn print_single_version(node: SppfNode, crossings: &[bool]) {
    print_crossings(crossings);
    let version = node.first_version();
    println!("{version}");
    let mut i = 0;
    let children = version.children();
    while i < children.len() {
        let mut child_crossings = crossings.to_owned();
        child_crossings.push(i < children.len() - 1);
        print_sppf(children.at(i), &child_crossings);
        i += 1;
    }
}
stevefan1999-personal commented 11 months ago

By the way, here is the code which is able to print the different versions of the SPPF:

use hime_redist::{ast::AstNode, sppf::SppfNode};

fn print_crossings(crossings: &[bool]) {
    let mut i = 0;
    if !crossings.is_empty() {
        while i < crossings.len() - 1 {
            print!("{:}", if crossings[i] { "|   " } else { "    " });
            i += 1;
        }
        print!("+-> ");
    }
}

fn print_sppf(node: SppfNode, crossings: &[bool]) {
    if node.versions_count() == 1 {
        print_single_version(node, crossings);
    } else {
        print_multi_versions(node, crossings);
    }
}

fn print_multi_versions(node: SppfNode, crossings: &[bool]) {
    print_crossings(crossings);
    println!("<versions>");

    for (index, version) in node.versions().into_iter().enumerate() {
        let mut crossings = crossings.to_owned();
        crossings.push(index < node.versions_count() - 1);
        print_crossings(&crossings);
        println!("{version}");
        let mut i = 0;
        let children = version.children();
        while i < children.len() {
            let mut child_crossings = crossings.to_owned();
            child_crossings.push(i < children.len() - 1);
            print_sppf(children.at(i), &child_crossings);
            i += 1;
        }
    }
}

fn print_single_version(node: SppfNode, crossings: &[bool]) {
    print_crossings(crossings);
    let version = node.first_version();
    println!("{version}");
    let mut i = 0;
    let children = version.children();
    while i < children.len() {
        let mut child_crossings = crossings.to_owned();
        child_crossings.push(i < children.len() - 1);
        print_sppf(children.at(i), &child_crossings);
        i += 1;
    }
}

I just used a simple modification by tracking the version number

fn print(node: SppfNode, crossings: &[bool]) {
    let mut version = 1;
    for node in node.versions() {
        let mut i = 0;
        if !crossings.is_empty() {
            while i < crossings.len() - 1 {
                print!("{:}", if crossings[i] { "|   " } else { "    " });
                i += 1;
            }
            print!("+-{version}-> ");
        }
        println!("{node}");

        i = 0;
        let children = node.children();
        while i < children.len() {
            let mut child_crossings = crossings.to_owned();
            child_crossings.push(i < children.len() - 1);
            print(children.at(i), &child_crossings);
            i += 1;
        }
        version += 1;
    }
}
stevefan1999-personal commented 11 months ago

@woutersl by the way, what need to be done to implement the "forest walker" (in contrast to the AST walker) as well? Are there any paper describing the technique to walk this kind of forest? Not sure if I could help but I have no clues if I'm not having any materials at the deck in the moment (I mean I'm not a student anymore already for a year haha).

But of course, the easiest treat is to keep using a depth-first approach to generate a bunch of spanning trees and try each tree and see if anyone of them terminates.

This really makes me reminisce the parallel the likes between NFA and DFA over this approach. In fact, I'm not sure if the GLR parser is also inspired by this. You just make the LR parser "forkable" at the current state, just like NFA is also "forkable" to try different transitions that terminates/matches.

I have another obscure idea I can't express clearly (I'm not sure if that's novel or not) that if we want to implement semantic checker to multiple "trees"/"branches" of the SPPF at the same time, at the very least, the data structure state holding the information about the current AST walker must be immutable -- which means data is cloned between each forest transition. Then we will need to use persistent data structure/Copy-On-Write mechanism/transactional memory to manage the state data. While I'm pretty sure functional programming people may have enjoyed writing parsers this way because their way to handle data structures are immutable by design, so this is trivial for all kinds of program, but this idea came back to the time when I was still using rust-peg (https://github.com/kevinmehall/rust-peg/discussions/301)

Oh I also spotted another problem, consider this C++ code:

int f() {
  return g(123);
}

int g(int x) {
  return x + 42;
}

This may look like a C code but it is only valid in C++, in C if you do not have forward declarations, you will assume a "bare function prototype" -- which means the function g is treated as unsigned (*)() rather than int (*)(int) in C-speak. So while both are semantically correct and parsable in both language, in C++ the semantic checker would hold on to it and see if it could resolve other types first. That's why C++'s approach would correctly resolve the function type of g.

So, to solve this problem, should we also admit combinations of the forests? Although I'm pretty sure this naive approach would lead to combinatorial explosion.

woutersl commented 11 months ago

@stevefan1999-personal I just pushed code that should fix the remaining issues with SPPF construction. I checked your example with C code and it indeed produces two versions now. A piece of advice, it would be best to rewrite the relevant part of the grammar as below:

compound_statement -> '{'! declaration_or_statement* '}'!;
declaration_or_statement -> declaration | statement ;

The reasoning is that doing so, the nodes with multiple versions will be declaration_or_statement and not compound_statement which would be more expensive due to combinatorial explosion.