antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.16k stars 3.7k forks source link

[Postgresql] Grammar is not idiomatic ANTLR #4291

Open masonwheeler opened 2 days ago

masonwheeler commented 2 days ago

If I had to guess, it looks like this Postgres parser grammar was produced by someone taking a grammar designed for a different parser generator and translating it as literally as possible. But it's got a lot of content that is very bad ANTLR.

For example, you see this pattern a lot:

from_clause
    : FROM from_list
    |
    ;

Having a rule end in | ; means it will always be considered a valid match, even if it contains no content. This is painful to work with, because now in the visitor, on a rule that contains this as a sub-rule, you can't simply say if (context.from_clause() != null) to see if you have a real match; the code to check for it is significantly messier.

The idomatic way to do this in ANTLR is to define it as:

from_clause
    : FROM from_list
    ;

and then have any rule that uses it invoke it as from_clause?.

Also, some things are just really weird. For example, the following:

from_list
    : non_ansi_join
    | table_ref (COMMA table_ref)*
    ;

non_ansi_join
    : table_ref (COMMA table_ref)+
    ;

Unless I'm overlooking some crucial detail, the non_ansi_join rule here is entirely superfluous because the alternative branch of from_list is a strict superset of it. Again, this feels like it was translated overly-literally from some other parser generator's grammar.

Would it be possible to clean this grammar up a bit?

kaby76 commented 2 days ago

Unfortunately, what gets implemented often isn't checked very well. The empty-alternative seems to be an easy thing to detect and fix via script. I'll write a script to detect this and add it to the build. I'll also write a script to fix the problem. This stuff should not be done by hand.

The other problem is an ambiguity. It is detected in my recent changes for detecting ambiguity over the grammars in the repo. This script finds the smallest example in the test suite with the ambiguity.

10/25-07:51:29 ~/issues/g4-4280/sql/postgresql/Generated-CSharp
$ min=99999
minf=''
for f in ../examples/*.sql
do
    dotnet trperf $f -c ar 2>/dev/null | grep from_list | grep -v '^0' -q
    if [ $? -eq 0 ]
    then
        c=`wc -l $f | awk '{print $1}'`
        if [ $c -lt $min ]
        then
            min=$c
            minf=$f
        fi
    fi
done
echo $min " " $minf
4   ../examples/non_ansi_join.sql
10/25-07:57:06 ~/issues/g4-4280/sql/postgresql/Generated-CSharp
$ dotnet trparse --ambig ../examples/non_ansi_join.sql | dotnet trtree -a
CSharp 0 ../examples/non_ansi_join.sql success 0.5531324
(root (stmtblock (stmtmulti (stmt (selectstmt (select_no_parens (select_clause (simple_select_intersect (simple_select_pramary (SELECT "SELECT") (opt_all_clause) (into_clause) (opt_target_list (target_list (target_el (a_expr (a_expr_qual (a_expr_lessless (a_expr_or (a_expr_and (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "a") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "au_id") (opt_uescape)))))))))))))))))))))))))))))) (COMMA ",") (target_el (a_expr (a_expr_qual (a_expr_lessless (a_expr_or (a_expr_and (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "t") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "titlr") (opt_uescape))))))))))))))))))))))))))))) (identifier (Identifier "e") (opt_uescape))))) (into_clause) (from_clause (FROM "FROM") (from_list (non_ansi_join (table_ref (relation_expr (qualified_name (colid (identifier (Identifier "titles") (opt_uescape))))) (opt_alias_clause (table_alias_clause (AS "AS") (table_alias (identifier (Identifier "t") (opt_uescape)))))) (COMMA ",") (table_ref (relation_expr (qualified_name (colid (identifier (Identifier "authors") (opt_uescape))))) (opt_alias_clause (table_alias_clause (AS "AS") (table_alias (identifier (Identifier "a") (opt_uescape)))))) (COMMA ",") (table_ref (relation_expr (qualified_name (colid (identifier (Identifier "titleauthor") (opt_uescape))))) (opt_alias_clause (table_alias_clause (AS "AS") (table_alias (identifier (Identifier "ta") (opt_uescape))))))))) (where_clause (WHERE "WHERE") (a_expr (a_expr_qual (a_expr_lessless (a_expr_or (a_expr_and (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "a") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "au_id") (opt_uescape)))))))))))))))))) (EQUAL "=") (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "ta") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "au_id") (opt_uescape)))))))))))))))))))))))) (AND "AND") (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "ta") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "title_id") (opt_uescape)))))))))))))))))) (EQUAL "=") (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "t") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "title_id") (opt_uescape)))))))))))))))))))))))) (AND "AND") (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "t") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "title") (opt_uescape))))))))))))))))) (LIKE "LIKE") (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (aexprconst (sconst (anysconst (StringConstant "'Example%'")) (opt_uescape))))))))))))) (opt_escape)))))))))))))) (group_clause) (having_clause) (window_clause)))) (opt_sort_clause)))))) (EOF ""))
(root (stmtblock (stmtmulti (stmt (selectstmt (select_no_parens (select_clause (simple_select_intersect (simple_select_pramary (SELECT "SELECT") (opt_all_clause) (into_clause) (opt_target_list (target_list (target_el (a_expr (a_expr_qual (a_expr_lessless (a_expr_or (a_expr_and (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "a") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "au_id") (opt_uescape)))))))))))))))))))))))))))))) (COMMA ",") (target_el (a_expr (a_expr_qual (a_expr_lessless (a_expr_or (a_expr_and (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "t") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "titlr") (opt_uescape))))))))))))))))))))))))))))) (identifier (Identifier "e") (opt_uescape))))) (into_clause) (from_clause (FROM "FROM") (from_list (table_ref (relation_expr (qualified_name (colid (identifier (Identifier "titles") (opt_uescape))))) (opt_alias_clause (table_alias_clause (AS "AS") (table_alias (identifier (Identifier "t") (opt_uescape)))))) (COMMA ",") (table_ref (relation_expr (qualified_name (colid (identifier (Identifier "authors") (opt_uescape))))) (opt_alias_clause (table_alias_clause (AS "AS") (table_alias (identifier (Identifier "a") (opt_uescape)))))) (COMMA ",") (table_ref (relation_expr (qualified_name (colid (identifier (Identifier "titleauthor") (opt_uescape))))) (opt_alias_clause (table_alias_clause (AS "AS") (table_alias (identifier (Identifier "ta") (opt_uescape)))))))) (where_clause (WHERE "WHERE") (a_expr (a_expr_qual (a_expr_lessless (a_expr_or (a_expr_and (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "a") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "au_id") (opt_uescape)))))))))))))))))) (EQUAL "=") (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "ta") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "au_id") (opt_uescape)))))))))))))))))))))))) (AND "AND") (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "ta") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "title_id") (opt_uescape)))))))))))))))))) (EQUAL "=") (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "t") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "title_id") (opt_uescape)))))))))))))))))))))))) (AND "AND") (a_expr_between (a_expr_in (a_expr_unary_not (a_expr_isnull (a_expr_is_not (a_expr_compare (a_expr_like (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (columnref (colid (identifier (Identifier "t") (opt_uescape))) (indirection (indirection_el (DOT ".") (attr_name (collabel (identifier (Identifier "title") (opt_uescape))))))))))))))))) (LIKE "LIKE") (a_expr_qual_op (a_expr_unary_qualop (a_expr_add (a_expr_mul (a_expr_caret (a_expr_unary_sign (a_expr_at_time_zone (a_expr_collate (a_expr_typecast (c_expr (aexprconst (sconst (anysconst (StringConstant "'Example%'")) (opt_uescape))))))))))))) (opt_escape)))))))))))))) (group_clause) (having_clause) (window_clause)))) (opt_sort_clause)))))) (EOF ""))

10/25-08:01:20 ~/issues/g4-4280/sql/postgresql/Generated-CSharp
$

But, as you can see from the ambig.txt file mentioned in https://github.com/antlr/grammars-v4/pull/4276, there's a lot of ambiguity in this grammar.

It's not clear exactly where this grammar came from, but https://github.com/postgres/postgres has many .y grammars, including src/backend/parser/gram.y, but it doesn't look like this grammar. But, it may have been from an earlier version. Again, there should be links to the exact versions of the files that were the source for this Antlr grammar.

masonwheeler commented 2 days ago

@kaby76 Thanks! I'd love to see what your script is able to make of this.

kaby76 commented 17 hours ago

Update...

This is an initial version of the Bash/Trash script to detect empty-alternatives in a rule, e.g., opt_with : WITH | ;.

#
# set -x
set -e
set -o pipefail
for dir in `find . -name desc.xml | sed 's#/desc.xml##' | sort -u`
do
    echo $dir
    # Find rules that contain top-level empty alts.
    # Note, not complete because the alt may be not empty, but could derive empty.
    dotnet trparse -l $dir/*.g4 2>/dev/null > save.pt
    cat save.pt | dotnet trxgrep ' //parserRuleSpec[./ruleBlock/ruleAltList/labeledAlt/alternative[count(./*) = 0]]/RULE_REF/text()' > rules.txt
    # Find locations of use that an operator applied to it.
    for r in `cat rules.txt`
    do
        rr=`echo $r | tr -d '\n' | tr -d '\r'`
        cat save.pt | dotnet trxgrep ' //parserRuleSpec/ruleBlock//element[./atom and not(./ebnfSuffix)]/atom/ruleref/RULE_REF[./text() = "'$rr'"]' | dotnet trcaret
    done
done

The script to delete the empty-alternative is a little harder to write because the script must determine which "OR" (i.e., '|'-operator in the EBNF) to actually delete. If the empty-alternative is after the COLON, then the empty-alternative and the following OR should be deleted. But, if the empty-alternative is preceded by an OR, then the empty-alternative and the preceding OR should be deleted. The code to do this is a little tricky, so I only wrote the later test to keep the test simple.

#
# set -x
set -e
set -o pipefail
for dir in `find . -name desc.xml | sed 's#/desc.xml##' | sort -u`
do
    echo $dir
    # Find rules that contain top-level empty alts.
    # Note, not complete because the alt may be not empty, but could derive empty.
    dotnet trparse -l $dir/*.g4 2>/dev/null > save.pt
    cat save.pt | dotnet trxgrep ' //parserRuleSpec[./ruleBlock/ruleAltList/labeledAlt/alternative[count(./*) = 0]]/RULE_REF/text()' > rules.txt
    # Find locations of use that an operator applied to it.
    for r in `cat rules.txt`
    do
        rr=`echo $r | tr -d '\n' | tr -d '\r'`
        echo Working on $rr
        dotnet trparse $dir/*.g4 2>/dev/null | dotnet trquery replace ' //parserRuleSpec/ruleBlock//element[./atom and not(./ebnfSuffix)]/atom/ruleref/RULE_REF[./text() = "'$rr'"]' "'$rr?'" | dotnet trsponge -o $dir -c
        # Remove empty alt in the rule.
        dotnet trparse $dir/*.g4 2>/dev/null | dotnet trquery delete "  //parserRuleSpec[RULE_REF/text() = '$rr']/ruleBlock/ruleAltList/labeledAlt[alternative/count(./*) = 0 and ./preceding-sibling::*[last()]/self::OR]/(. | ./preceding-sibling::OR[last()])" | dotnet trsponge -o $dir -c
    done
done

The fix script does change the grammar, refactoring it apparently so that the second application of the "detect.sh" script above doesn't find any empty-alternatives. I have tested it, and all example files parse, but there are still a few problems. PostgreSQLParser.g4.txt

1) The script doesn't compute whether a symbol can derive empty. That is what should be checked rather than empty-alternative's. Consequently, in refactoring the grammar, I now see more warnings like ... at least one alternative that can match an empty string. It seems that the Antlr Tool may have a bug in computing symbols that derive empty. There shouldn't be a change in the number of these warnings. The altered grammar changes the rule sql_expression : opt_target_list into_clause from_clause where_clause group_clause having_clause window_clause ; to sql_expression : opt_target_list? into_clause? from_clause? where_clause? group_clause? having_clause? window_clause? ;. That's pretty interesting. An expression is empty. In keeping with what we are trying to do, an expression, really, should not derive empty! 1) Detection and fixing are slow because each empty-alternative is performed one at a time. It can be much faster when the queries are combined.

masonwheeler commented 13 hours ago

Wow, that sql_expression definition is a bit of a mess. Is there any way for ANTLR to express the concept of "at least one of these optional rules must be matched"?