Open callnothing opened 5 years ago
I did encounter a similiar issue, when parsing a creation statement.
create_sql = '''CREATE TABLE countries (
id INTEGER PRIMARY KEY,
code VARCHAR(3) UNIQUE NOT NULL,
name VARCHAR(126) UNIQUE NOT NULL
);'''
parenthesis = [*(sqlparse.parse(create_sql )[0])][6]
WS_TOKENS=(sqlparse.tokens.Token.Text.Whitespace,sqlparse.tokens.Token.Text.Whitespace.Newline,sqlparse.tokens.Token.Comment)
for n,t in enumerate(parenthesis):
if not t.ttype in WS_TOKENS :
print('{0:>2}: is_group {1},\tclass {2:>14},\tttype {3:>12}\t, parent {4}[{5}]\t, "{6}"'.format(
n, ##Token Num in Parenthesis
t.is_group, ##Is TokenGroup (e.g. IdentifierList)
str(type(t)).split('.')[-1][:-2], ##Classname (last part)
str(t.ttype)[6:], ##Token ttype (if not None)
str(type(t.parent)).split('.')[-1], ##Parent, either the Parenthesis OR (if flattended) an Identifier List
id(t.parent), ##Parents ID
str(t))) ##SQL Str
returns:
0: is_group False, class Token, ttype Punctuation , parent Parenthesis'>[2257639606576] , "("
3: is_group True, class Identifier, ttype , parent Parenthesis'>[2257639606576] , "id"
5: is_group False, class Token, ttype Name.Builtin , parent Parenthesis'>[2257639606576] , "INTEGER"
7: is_group False, class Token, ttype Keyword , parent Parenthesis'>[2257639606576] , "PRIMARY"
9: is_group True, class IdentifierList, ttype , parent Parenthesis'>[2257639606576] , "KEY,
code"
11: is_group True, class Function, ttype , parent Parenthesis'>[2257639606576] , "VARCHAR(3)"
13: is_group False, class Token, ttype Keyword , parent Parenthesis'>[2257639606576] , "UNIQUE"
15: is_group True, class IdentifierList, ttype , parent Parenthesis'>[2257639606576] , "NOT NULL,
name"
17: is_group True, class Function, ttype , parent Parenthesis'>[2257639606576] , "VARCHAR(126)"
19: is_group False, class Token, ttype Keyword , parent Parenthesis'>[2257639606576] , "UNIQUE"
21: is_group False, class Token, ttype Keyword , parent Parenthesis'>[2257639606576] , "NOT NULL"
24: is_group False, class Token, ttype Punctuation , parent Parenthesis'>[2257639606576] , ")"
See Line 9 & 15.
Keyword, (Whitspace), Comma, (Whitespace) and Name of the next identifier are grouped into one IdentifierList.
Using print('{0:>2}: "{6:<14}" ,is_group {1},\tclass {2:>14},\tttype {3:>12}\t, parent {4}[{5}]\t'.format(...
(token # : SQL, is_group,...) I do see, that the IdentifierList is grouped around the line-break.
7: "PRIMARY " ,is_group False, class Token, ttype Keyword , parent Parenthesis'>[2257639606576]
9: "KEY,
code " ,is_group True, class IdentifierList, ttype , parent Parenthesis'>[2257639606576]
11: "VARCHAR(3) " ,is_group True, class Function, ttype , parent Parenthesis'>[2257639606576]
And
13: "UNIQUE " ,is_group False, class Token, ttype Keyword , parent Parenthesis'>[2257639606576]
15: "NOT NULL,
name" ,is_group True, class IdentifierList, ttype , parent Parenthesis'>[2257639606576]
17: "VARCHAR(126) " ,is_group True, class Function, ttype , parent Parenthesis'>[2257639606576]
A flattened view (SQL and Parent first) shows that more explicitly
for n,t in enumerate(parenthesis.flatten()):
if not t.ttype in WS_TOKENS :
print('{0:>2}: "{6:<10}", parent {4}[{5}],\tclass {2:>5},\tttype {3:>12}\t\t"'.format(
n, ##Token # in Parenthesis
....
str(t))) ##SQL Str
shows, that Keywords, followed by a Comma and an Name become grouped into an Identifier List
0: "( ", parent Parenthesis'>[2257639606576], class Token, ttype Punctuation
...
7: "PRIMARY ", parent Parenthesis'>[2257639606576], class Token, ttype Keyword
9: "KEY ", parent IdentifierList'>[2257640149664], class Token, ttype Keyword
10: ", ", parent IdentifierList'>[2257640149664], class Token, ttype Punctuation
13: "code ", parent Identifier'>[2257640149304], class Token, ttype Name
15: "VARCHAR ", parent Identifier'>[2257640150264], class Token, ttype Name
16: "( ", parent Parenthesis'>[2257639606696], class Token, ttype Punctuation
17: "3 ", parent Parenthesis'>[2257639606696], class Token, ttype Literal.Number.Integer
18: ") ", parent Parenthesis'>[2257639606696], class Token, ttype Punctuation
20: "UNIQUE ", parent Parenthesis'>[2257639606576], class Token, ttype Keyword
22: "NOT NULL ", parent IdentifierList'>[2257640149184], class Token, ttype Keyword
23: ", ", parent IdentifierList'>[2257640149184], class Token, ttype Punctuation
26: "name ", parent Identifier'>[2257640149784], class Token, ttype Name
28: "VARCHAR ", parent Identifier'>[2257640151104], class Token, ttype Name
29: "( ", parent Parenthesis'>[2257639606096], class Token, ttype Punctuation
30: "126 ", parent Parenthesis'>[2257639606096], class Token, ttype Literal.Number.Integer
31: ") ", parent Parenthesis'>[2257639606096], class Token, ttype Punctuation
33: "UNIQUE ", parent Parenthesis'>[2257639606576], class Token, ttype Keyword
...
38: ") ", parent Parenthesis'>[2257639606576], class Token, ttype Punctuation
I do believe the grouping around Commas does something strange, and I do believe this is the same as Bug #444. I'd assume this happens in engine/grouping.py
when group_identifier_list
(~line 330) is called as the last grouping function. But I don't understand the grouping yet, so I hope this makes at least a comprehensible description and a bump.
[edit : typo in example]
This behaviour comes fromengine/grouping.py
(currently line 224) :
','
is matched as a identifier-separator'KEY'
(this example) or 'DEFAULT'
(first example) is matched through m_role
as a valid previous token before ','
in an Identifier List.
def group_identifier_list(tlist):
m_role = T.Keyword, ('null', 'role')
...
def match(token):
return token.match(T.Punctuation, ',')
def valid(token):
return imt(token, i=sqlcls, m=m_role, t=ttypes)
...
valid_prev = valid_next = valid
_group(tlist, sql.IdentifierList, match,
valid_prev, valid_next, post, extend=True)
...
The included Keyword before the Comma, seems to be the issue during grouping for both cases. [edit: source, typo]
this is my test sql
alter table test add column column1 datetime not null default '1000-01-01 00:00:00',add column column2 datetime not null default '1000-01-01 00:00:00';
and print all tokens
for token in t.tokens: print token.ttype, str(token.value)
Token.Keyword.DDL alter Token.Text.Whitespace Token.Keyword table Token.Text.Whitespace None test Token.Text.Whitespace Token.Keyword add Token.Text.Whitespace Token.Keyword column Token.Text.Whitespace None column1 datetime Token.Text.Whitespace Token.Keyword not null Token.Text.Whitespace Token.Keyword default Token.Text.Whitespace None '1000-01-01 00:00:00',add Token.Text.Whitespace Token.Keyword column Token.Text.Whitespace None column2 datetime Token.Text.Whitespace Token.Keyword not null Token.Text.Whitespace Token.Keyword default Token.Text.Whitespace Token.Literal.String.Single '1000-01-01 00:00:00' Token.Punctuation ;
the add keyword can not be parsed