andialbrecht / sqlparse

A non-validating SQL parser module for Python
BSD 3-Clause "New" or "Revised" License
3.72k stars 694 forks source link

Why does parsestream() nest some tokens with TokenList of length 1? #266

Closed dlenski closed 8 years ago

dlenski commented 8 years ago

Take a DDL statement like this and parsestream it:

ALTER TABLE FOO.BAR ADD CONSTRAINT FOOBAR_PK PRIMARY KEY (ABC, XYZ, FNORD);

This yields a single sqlparse.sql.Statement (subclass of TokenList). This TokenList in turn contains further nested TokenList.

Some of these make sense to me, as when FOO.BAR is grouped as a nested TokenList, but in several cases there is a nested TokenList that contains only a single token…

Token.Text.Whitespace with value=u' '
Token.Keyword.DDL with value=u'ALTER'
Token.Text.Whitespace with value=u' '
Token.Keyword with value=u'TABLE'
Token.Text.Whitespace with value=u' '
TokenList with value=u'FOO.BAR'
    Token.Name with value=u'FOO'
    Token.Punctuation with value=u'.'
    Token.Name with value=u'BAR'
Token.Text.Whitespace with value=u' '
Token.Keyword with value=u'ADD'
Token.Text.Whitespace with value=u' '
Token.Keyword with value=u'CONSTRAINT'
Token.Text.Whitespace with value=u' '
TokenList with value=u'FOOBAR_PK'
    Token.Name with value=u'FOOBAR_PK'
    # why was there a TokenList containing only 1 token here?
Token.Text.Whitespace with value=u' '
Token.Keyword with value=u'PRIMARY'
Token.Text.Whitespace with value=u' '
Token.Keyword with value=u'KEY'
Token.Text.Whitespace with value=u' '
TokenList with value=u'(ABC, XYZ, FNORD)'
    Token.Punctuation with value=u'('
    TokenList with value=u'ABC, XYZ, FNORD'
        TokenList with value=u'ABC'
            Token.Name with value=u'ABC'
            # why was there a TokenList containing only 1 token here?
        Token.Punctuation with value=u','
        Token.Text.Whitespace with value=u' '
        TokenList with value=u'XYZ'
            Token.Name with value=u'XYZ'
            # why was there a TokenList containing only 1 token here?
        Token.Punctuation with value=u','
        Token.Text.Whitespace with value=u' '
        TokenList with value=u'FNORD'
            Token.Name with value=u'FNORD'
            # why was there a TokenList containing only 1 token here?
    Token.Punctuation with value=u')'
Token.Punctuation with value=u';'
Token.Text.Whitespace with value=u' '

What is the logic here? How can I predict when I'll get a nested TokenList, as opposed to a single token?

I'm using sqlparse v0.1.19 with Python 2.7, and the following recursive function to print the nested representation of a blob of SQL text:

from itertools import chain

def token_dive(sql, indent=0):
    if isinstance(sql, basestring):
        tokens = chain.from_iterable(stmt.tokens for stmt in sqlparse.parsestream(sql))
    else:
        tokens = sql.tokens

    for ii,token in enumerate(tokens):
        #print token, token.ttype
        if isinstance(token, sqlparse.sql.TokenList):
            print('{}TokenList with value={}'.format(' '*(4*indent), repr(token.value)))
            token_dive(token, indent+1)
        else:
            print('{}{} with value={}'.format(' '*(4*indent), token.ttype, repr(token.value)))
    else:
        if ii==0 and indent!=0:
            print('{}# why was there a TokenList containing only 1 token here?'.format(' '*(4*indent)))
vmuriart commented 8 years ago

you can use tokenlist._pprint_tree() to get the tree printed. TokenList is mostly a generic class that gets sub-classed into somewhat "proper" expressions. The particular cases you are pointing out are identifier objects, and they have the habit of sometimes nesting an identifier within an identifier that only encapsulated a single token.

I been trying to figure that one out myself as to what the proper behavior should be, and when should it be and not be an identifier.

import sqlparse

sql = 'ALTER TABLE FOO.BAR ADD CONSTRAINT FOOBAR_PK PRIMARY KEY (ABC, XYZ, FNORD);'
p = sqlparse.parse(sql)[0]
p._pprint_tree()

Result:


 | 0 DDL 'ALTER'
 | 1 Whitespace ' '
 | 2 Keyword 'TABLE'
 | 3 Whitespace ' '
 +-4 Identifier 'FOO.BAR'
   | 0 Name 'FOO'
   | 1 Punctuation '.'
   | 2 Name 'BAR'
 | 5 Whitespace ' '
 | 6 Keyword 'ADD'
 | 7 Whitespace ' '
 | 8 Keyword 'CONSTR...'
 | 9 Whitespace ' '
 +-10 Identifier 'FOOBAR...'
   | 0 Name 'FOOBAR...'  # this one 
 | 11 Whitespace ' '
 | 12 Keyword 'PRIMARY'
 | 13 Whitespace ' '
 | 14 Keyword 'KEY'
 | 15 Whitespace ' '
 +-16 Parenthesis '(ABC, ...'
   | 0 Punctuation '('
   +-1 IdentifierList 'ABC, X...'
     +-0 Identifier 'ABC'
       | 0 Name 'ABC'  # this one 
     | 1 Punctuation ','
     | 2 Whitespace ' '
     +-3 Identifier 'XYZ'
       | 0 Name 'XYZ'  # this one 
     | 4 Punctuation ','
     | 5 Whitespace ' '
     +-6 Identifier 'FNORD'
       | 0 Name 'FNORD'  # this one 
   | 2 Punctuation ')'
 | 17 Punctuation ';'
dlenski commented 8 years ago

Thanks, @vmuriart! I wish I'd known about _pprint_tree before rolling my own.

This actually makes a lot of sense… anywhere an identifier appears (whether a single Name or Name.Name.Name), it gets its own group.

I probably should've picked a different example, like the following one, where an Oracle DDL statement gets grouped in odd ways

sql = 'CREATE SEQUENCE FOO.BAR_SEQ START WITH 1000 MAXVALUE 100000 NOCYCLE;'
p = sqlparse.parse(sql)[0]
p._pprint_tree()
 | 0 DDL 'CREATE'
 | 1 Whitespace ' '
 +-2 Identifier 'SEQUEN...'
   | 0 Name 'SEQUEN...'
   | 1 Whitespace ' '
   | 2 Name 'FOO'
 | 3 Punctuation '.'
 +-4 Identifier '"BAR_S...'
   | 0 Symbol '"BAR_S...'
 | 5 Whitespace ' '
 | 6 DML 'START'
 | 7 Whitespace ' '
 | 8 Keyword 'WITH'
 | 9 Whitespace ' '
 | 10 Integer '1000'
 | 11 Whitespace ' '
 | 12 Keyword 'MAXVAL...'
 | 13 Whitespace ' '
 +-14 Identifier '100000...'  # this is strange
   | 0 Integer '100000'
   | 1 Whitespace ' '
   | 2 Name 'NOCYCLE'
 | 15 Punctuation ';'

It's probably impossible to get this 100% right without poring through Oracle's ridiculous syntax diagrams for every case… but the grouping of 100000 NOCYCLE into an Identifier seems quite unexpected.

vmuriart commented 8 years ago

Two of the most common prs for:

  1. Paranthesis after a keyword causing it be to identified as a function/name
  2. Missing keyword in the keyword dictionary causing the lexer to make them Name tokens.

In your case NOCYCLE is missing from the valid keywords dictionary.

Is SEQUENCE in position p[2][0] correct, that looks like it should be a keyword. I imagine that FOO.BAR_SEQ should have been grouped into an identifier instead.

The more interesting thing is how did BAR_SEQ become "BAR_SEQ" (quotation marks).

dlenski commented 8 years ago

Gotcha, thanks.

By the way, the Oracle docs on keywords and legal identifiers are… simply insane. There are some syntactically meaningful words like NOCYCLE which not actually reserved words, and can be used in object names even though this "may lead to unpredictable results":

https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm

3. The Oracle SQL language contains other words that have special meanings. These words include datatypes, schema names, function names, the dummy system table DUAL, and keywords (the uppercase words in SQL statements, such as DIMENSION, SEGMENT, ALLOCATE, DISABLE, and so forth). These words are not reserved. However, Oracle uses them internally in specific ways. Therefore, if you use these words as names for objects and object parts, then your SQL statements may be more difficult to read and may lead to unpredictable results.

The more interesting thing is how did BAR_SEQ become "BAR_SEQ" (quotation marks).

That's because I changed the command slightly before pasting the output.

Me ← (derp)

dlenski commented 8 years ago

Actually, SEQUENCE was already added in 1ad556f … but didn't make it into the 0.1.19 release (it was merged right after).

screenshot from 2016-06-25 18-26-02

vmuriart commented 8 years ago

good ol' oracle. I'm just hoping noone requests to support this oracle 'feature' that makes this statement valid

select*from"DUAL";

It might be supported in other languages as well, but none the less; its a case I don't want to encounter.

Feel free to submit prs for any missing keywords and relevant tests if needed; especially before 0.2.0 gets finalized. Down the line we might add an option for dialect specific keyword dictionaries to allow/disallow keywords and names as needed.

vmuriart commented 8 years ago

Below is the tree after merging the changes

 0 DDL 'CREATE'
 1 Whitespace ' '
 2 Keyword 'SEQUEN...'
 3 Whitespace ' '
 4 Identifier 'FOO.BA...'
 |  0 Name 'FOO'
 |  1 Punctuation '.'
 |  2 Name 'BAR_SEQ'
 5 Whitespace ' '
 6 DML 'START'
 7 Whitespace ' '
 8 CTE 'WITH'
 9 Whitespace ' '
10 Integer '1000'
11 Whitespace ' '
12 Keyword 'MAXVAL...'
13 Whitespace ' '
14 Integer '100000'
15 Whitespace ' '
16 Keyword 'NOCYCLE'
17 Punctuation ';'

8 shouldn't be CTE but thats a different story.

dlenski commented 8 years ago

That's definitely an improvement. I'm curious about why 100000 NOCYCLE was previously grouped as an Identifier.

Is there ever a case where an Integer is actually used to start an Identifier … ? Seems like maybe there should be a patch to prevent this even if sqlparse doesn't know what the adjacent words mean.

vmuriart commented 8 years ago

10000 is being classed as an integer while (back then) nocycle was a "name", sqlparse was identifying it as being an alias. similar to:

    select 1 aliasname from dual

where the column name will be aliasname

vmuriart commented 8 years ago

in otherword, because 100000 has an "alias" sqlparse grouped it into an identifier.

dlenski commented 8 years ago

Well now that just… totally makes sense. I was thinking maybe you shouldn't make this kind of guess in a DDL statement, but then I realized that this would mess up cases like:

CREATE TABLE junk AS select 1 aliasname from dual

SQL is … >:-(