antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.19k stars 3.71k forks source link

[postgresql] collabel is ambiguous; postgresql contains some plsql rules. #4308

Open kaby76 opened 3 days ago

kaby76 commented 3 days ago

Consider input string SELECT 'trailing' AS first; in comments.sql. This is ambiguous because first has three different possible trees:

This is caused because of the rule https://github.com/antlr/grammars-v4/blob/199a5121ece05d2f2e7eca330d0738220499e80c/sql/postgresql/PostgreSQLParser.g4#L4233-L4240

There is quite a bit of overlap across each of the alts.

Over in the original gram.y, the rule is https://github.com/postgres/postgres/blob/027124a872d7b5dfddc69590af42f626b1727dba/src/backend/parser/gram.y#L17560-L17565

kaby76 commented 2 days ago

Symbol classes should be disjoint. I took the gram.y grammar and extracted the symbol classes for each of the relevant productions. There are attached here: yacc_bare_label_keyword.txt yacc_col_name_keyword.txt yacc_reserved_keyword.txt yacc_type_func_name_keyword.txt yacc_unreserved_keyword.txt

I have checked the disjointness of these sets using first sort -c ... for each file, and then comm -1 -2 ... ... across all permutations of the files listed. The yacc grammar is correct.

kaby76 commented 2 days ago

Over in the Antlr grammar, the symbol sets are: sym_col_name_keyword.txt sym_plsql_unreserved_keyword.txt sym_reserved_keyword.txt sym_type_func_name_keyword.txt sym_unreserved_keyword.txt

These sets are also disjoint--except for plsql_unreserved_keyword, which overlaps over several of the other sets.

You cannot use non-disjoint set combinations in Antlr. It will cause ambiguity. This can be rectified several ways, but best to go with the yacc version because yacc requires disjoint sets, Antlr does not.

kaby76 commented 1 day ago

The postgresql grammar appears to have a mishmash of PlSQL embedded in PostgreSQL. This is wrong. If you want to combine the two grammars, it should be done in another way, and certainly not as part of the official PostgreSQL grammar.

I am removing PlSQL productions.