andialbrecht / sqlparse

A non-validating SQL parser module for Python
BSD 3-Clause "New" or "Revised" License
3.75k stars 699 forks source link

parsing parenthesis after keywords #615

Open IterableTrucks opened 3 years ago

IterableTrucks commented 3 years ago

Parsing a sql with parenthesis right after keyword without any whitespace characters (which is syntactically correct in SQL standard) will produce the belowing unexpected result:

sqlparse.parse('select * from child join(select * from parent) as p on child.parent_id=p.id;')[0].tokens

[<DML 'select' at 0x7F448E3FF980>, <Whitespace ' ' at 0x7F448E3FFA60>, <Wildcard '*' at 0x7F448E3FFAD0>, 
<Whitespace ' ' at 0x7F448E3FFB40>, <Keyword 'from' at 0x7F448E3FF9F0>, <Whitespace ' ' at 0x7F448E3FFC20>, 
<Identifier 'child ...' at 0x7F448E3FBA50>, <Whitespace ' ' at 0x7F448E4053D0>, <Keyword 'on' at 0x7F448E405440>, 
<Whitespace ' ' at 0x7F448E4054B0>, <Comparison 'child....' at 0x7F448E3FBC50>, <Punctuation ';' at 0x7F448E405830>]

Compared with the correct result adding a whitespace after JOIN:


sqlparse.parse('select * from child join (select * from parent) as p on child.parent_id=p.id;')[0].tokens

[<DML 'select' at 0x7F448E4123D0>, <Whitespace ' ' at 0x7F448E412440>, <Wildcard '*' at 0x7F448E4124B0>, 
<Whitespace ' ' at 0x7F448E412520>, <Keyword 'from' at 0x7F448E40FD70>, <Whitespace ' ' at 0x7F448E412600>, 
<Identifier 'child' at 0x7F448E40BED0>, <Whitespace ' ' at 0x7F448E4126E0>, <Keyword 'join' at 0x7F448E412590>, 
<Whitespace ' ' at 0x7F448E4127C0>, <Identifier '(selec...' at 0x7F448E40BD50>, <Whitespace ' ' at 0x7F448E412DE0>, 
<Keyword 'on' at 0x7F448E40F2F0>, <Whitespace ' ' at 0x7F448E40F360>, <Comparison 'child....' at 0x7F448E40BA50>, 
<Punctuation ';' at 0x7F448E40EDE0>]
froukees commented 2 years ago

I've noticed there were multiple issues already solving this issue for individual keywords (e.g. https://github.com/andialbrecht/sqlparse/issues/507, https://github.com/andialbrecht/sqlparse/issues/446), closed by extending this list of keywords: https://github.com/andialbrecht/sqlparse/blob/3bc7d93254cbef71bdef91905f1814201a1b1f02/sqlparse/keywords.py#L52

I found this issue while searching for a reason why Querybook, which uses sqlparse, is detecting UNNEST as a table name in e.g. Presto query ...CROSS JOIN UNNEST(t.id).... The reason is UNNEST is detected as tokens.Name instead of tokens.Keyword, because it's followed by parenthesis = exactly the same issue as it the mentioned closed issue.

So I think there are two options:

  1. Continue with extending the list, now with JOIN and UNNEST 🙏 keywords
  2. Approach this somehow generally
    • I was thinking about replacing (r'[A-ZÀ-Ü]\w*(?=\()', tokens.Name) with (r'[A-ZÀ-Ü]\w*(?=\()', is_keyword) here https://github.com/andialbrecht/sqlparse/blob/3bc7d93254cbef71bdef91905f1814201a1b1f02/sqlparse/keywords.py#L63 but then some tests were broken, because sqlparse.sql.Function wasn't detected correctly for keywords, that were handled as names before, and therefore calling get_parameters() method of sql.Function is failing/missing. Then it got more complicated with the grouping functions and I didn't find any good way how to fix it without breaking any unit test, yet. But seemed like correct idea in the beginning...

What do you think @andialbrecht ?

ElectronicRU commented 9 months ago

Adding to the list WHERE x = ANY(...), where ANY is a keyword, not a function.