andialbrecht / sqlparse

A non-validating SQL parser module for Python
BSD 3-Clause "New" or "Revised" License
3.74k stars 697 forks source link

can not parse add multi columns #467

Open callnothing opened 5 years ago

callnothing commented 5 years ago

this is my test sql

alter table test add column column1 datetime not null default '1000-01-01 00:00:00',add column column2 datetime not null default '1000-01-01 00:00:00';

and print all tokens

for token in t.tokens: print token.ttype, str(token.value)

Token.Keyword.DDL alter Token.Text.Whitespace Token.Keyword table Token.Text.Whitespace None test Token.Text.Whitespace Token.Keyword add Token.Text.Whitespace Token.Keyword column Token.Text.Whitespace None column1 datetime Token.Text.Whitespace Token.Keyword not null Token.Text.Whitespace Token.Keyword default Token.Text.Whitespace None '1000-01-01 00:00:00',add Token.Text.Whitespace Token.Keyword column Token.Text.Whitespace None column2 datetime Token.Text.Whitespace Token.Keyword not null Token.Text.Whitespace Token.Keyword default Token.Text.Whitespace Token.Literal.String.Single '1000-01-01 00:00:00' Token.Punctuation ;

the add keyword can not be parsed

iso13485 commented 5 years ago

I did encounter a similiar issue, when parsing a creation statement.

create_sql = '''CREATE TABLE countries (
    id INTEGER PRIMARY KEY,
    code VARCHAR(3) UNIQUE NOT NULL,
    name VARCHAR(126) UNIQUE NOT NULL
    );'''

parenthesis = [*(sqlparse.parse(create_sql )[0])][6]

WS_TOKENS=(sqlparse.tokens.Token.Text.Whitespace,sqlparse.tokens.Token.Text.Whitespace.Newline,sqlparse.tokens.Token.Comment)

for n,t in enumerate(parenthesis):
    if not t.ttype in WS_TOKENS : 
        print('{0:>2}: is_group {1},\tclass {2:>14},\tttype {3:>12}\t, parent {4}[{5}]\t, "{6}"'.format(
            n,                                   ##Token Num in Parenthesis
            t.is_group,                          ##Is TokenGroup (e.g. IdentifierList)
            str(type(t)).split('.')[-1][:-2],    ##Classname (last part)
            str(t.ttype)[6:],                    ##Token ttype (if not None)
            str(type(t.parent)).split('.')[-1],  ##Parent, either the Parenthesis OR (if flattended) an Identifier List
            id(t.parent),                        ##Parents ID
            str(t)))                             ##SQL Str

returns:

 0: is_group False, class          Token,   ttype  Punctuation  , parent Parenthesis'>[2257639606576]   , "("
 3: is_group True,  class     Identifier,   ttype               , parent Parenthesis'>[2257639606576]   , "id"
 5: is_group False, class          Token,   ttype Name.Builtin  , parent Parenthesis'>[2257639606576]   , "INTEGER"
 7: is_group False, class          Token,   ttype      Keyword  , parent Parenthesis'>[2257639606576]   , "PRIMARY"
9: is_group True,   class IdentifierList,   ttype               , parent Parenthesis'>[2257639606576]   , "KEY,
    code"
11: is_group True,  class       Function,   ttype               , parent Parenthesis'>[2257639606576]   , "VARCHAR(3)"
13: is_group False, class          Token,   ttype      Keyword  , parent Parenthesis'>[2257639606576]   , "UNIQUE"
15: is_group True,  class IdentifierList,   ttype               , parent Parenthesis'>[2257639606576]   , "NOT NULL,
    name"
17: is_group True,  class       Function,   ttype               , parent Parenthesis'>[2257639606576]   , "VARCHAR(126)"
19: is_group False, class          Token,   ttype      Keyword  , parent Parenthesis'>[2257639606576]   , "UNIQUE"
21: is_group False, class          Token,   ttype      Keyword  , parent Parenthesis'>[2257639606576]   , "NOT NULL"
24: is_group False, class          Token,   ttype  Punctuation  , parent Parenthesis'>[2257639606576]   , ")"

See Line 9 & 15.

Keyword, (Whitspace), Comma, (Whitespace) and Name of the next identifier are grouped into one IdentifierList.

Using print('{0:>2}: "{6:<14}" ,is_group {1},\tclass {2:>14},\tttype {3:>12}\t, parent {4}[{5}]\t'.format(... (token # : SQL, is_group,...) I do see, that the IdentifierList is grouped around the line-break.

 7: "PRIMARY       " ,is_group False,   class          Token,   ttype      Keyword  , parent Parenthesis'>[2257639606576]   
 9: "KEY,
    code " ,is_group True,  class IdentifierList,   ttype               , parent Parenthesis'>[2257639606576]   
11: "VARCHAR(3)    " ,is_group True,    class       Function,   ttype               , parent Parenthesis'>[2257639606576]   

And

13: "UNIQUE        " ,is_group False,   class          Token,   ttype      Keyword  , parent Parenthesis'>[2257639606576]   
15: "NOT NULL,
    name" ,is_group True,   class IdentifierList,   ttype               , parent Parenthesis'>[2257639606576]   
17: "VARCHAR(126)  " ,is_group True,    class       Function,   ttype               , parent Parenthesis'>[2257639606576]   

A flattened view (SQL and Parent first) shows that more explicitly

for n,t in enumerate(parenthesis.flatten()):
    if not t.ttype in WS_TOKENS : 
        print('{0:>2}: "{6:<10}", parent {4}[{5}],\tclass {2:>5},\tttype {3:>12}\t\t"'.format(
            n,                                   ##Token # in Parenthesis
            ....
            str(t)))                             ##SQL Str

shows, that Keywords, followed by a Comma and an Name become grouped into an Identifier List

 0: "(         ", parent Parenthesis'>[2257639606576],  class Token,    ttype  Punctuation      
...     
 7: "PRIMARY   ", parent Parenthesis'>[2257639606576],  class Token,    ttype      Keyword      
 9: "KEY       ", parent IdentifierList'>[2257640149664],   class Token,    ttype      Keyword      
10: ",         ", parent IdentifierList'>[2257640149664],   class Token,    ttype  Punctuation      
13: "code      ", parent Identifier'>[2257640149304],   class Token,    ttype         Name      
15: "VARCHAR   ", parent Identifier'>[2257640150264],   class Token,    ttype         Name      
16: "(         ", parent Parenthesis'>[2257639606696],  class Token,    ttype  Punctuation      
17: "3         ", parent Parenthesis'>[2257639606696],  class Token,    ttype Literal.Number.Integer        
18: ")         ", parent Parenthesis'>[2257639606696],  class Token,    ttype  Punctuation      
20: "UNIQUE    ", parent Parenthesis'>[2257639606576],  class Token,    ttype      Keyword      
22: "NOT NULL  ", parent IdentifierList'>[2257640149184],   class Token,    ttype      Keyword      
23: ",         ", parent IdentifierList'>[2257640149184],   class Token,    ttype  Punctuation      
26: "name      ", parent Identifier'>[2257640149784],   class Token,    ttype         Name      
28: "VARCHAR   ", parent Identifier'>[2257640151104],   class Token,    ttype         Name      
29: "(         ", parent Parenthesis'>[2257639606096],  class Token,    ttype  Punctuation      
30: "126       ", parent Parenthesis'>[2257639606096],  class Token,    ttype Literal.Number.Integer        
31: ")         ", parent Parenthesis'>[2257639606096],  class Token,    ttype  Punctuation      
33: "UNIQUE    ", parent Parenthesis'>[2257639606576],  class Token,    ttype      Keyword      
... 
38: ")         ", parent Parenthesis'>[2257639606576],  class Token,    ttype  Punctuation  

I do believe the grouping around Commas does something strange, and I do believe this is the same as Bug #444. I'd assume this happens in engine/grouping.py when group_identifier_list (~line 330) is called as the last grouping function. But I don't understand the grouping yet, so I hope this makes at least a comprehensible description and a bump. [edit : typo in example]

This behaviour comes fromengine/grouping.py (currently line 224) :

  1. ',' is matched as a identifier-separator
  2. 'KEY' (this example) or 'DEFAULT' (first example) is matched through m_role as a valid previous token before ',' in an Identifier List.

    def group_identifier_list(tlist):
    m_role = T.Keyword, ('null', 'role')
    ...
    def match(token):
        return token.match(T.Punctuation, ',')
    
    def valid(token):
        return imt(token, i=sqlcls, m=m_role, t=ttypes)
    ...
    valid_prev = valid_next = valid
        _group(tlist, sql.IdentifierList, match,
               valid_prev, valid_next, post, extend=True)
    ...

    The included Keyword before the Comma, seems to be the issue during grouping for both cases. [edit: source, typo]