Closed dlenski closed 8 years ago
you can use tokenlist._pprint_tree()
to get the tree printed.
TokenList
is mostly a generic class that gets sub-classed into somewhat "proper" expressions.
The particular cases you are pointing out are identifier
objects, and they have the habit of sometimes nesting an identifier
within an identifier
that only encapsulated a single token.
I been trying to figure that one out myself as to what the proper behavior should be, and when should it be and not be an identifier.
import sqlparse
sql = 'ALTER TABLE FOO.BAR ADD CONSTRAINT FOOBAR_PK PRIMARY KEY (ABC, XYZ, FNORD);'
p = sqlparse.parse(sql)[0]
p._pprint_tree()
Result:
| 0 DDL 'ALTER'
| 1 Whitespace ' '
| 2 Keyword 'TABLE'
| 3 Whitespace ' '
+-4 Identifier 'FOO.BAR'
| 0 Name 'FOO'
| 1 Punctuation '.'
| 2 Name 'BAR'
| 5 Whitespace ' '
| 6 Keyword 'ADD'
| 7 Whitespace ' '
| 8 Keyword 'CONSTR...'
| 9 Whitespace ' '
+-10 Identifier 'FOOBAR...'
| 0 Name 'FOOBAR...' # this one
| 11 Whitespace ' '
| 12 Keyword 'PRIMARY'
| 13 Whitespace ' '
| 14 Keyword 'KEY'
| 15 Whitespace ' '
+-16 Parenthesis '(ABC, ...'
| 0 Punctuation '('
+-1 IdentifierList 'ABC, X...'
+-0 Identifier 'ABC'
| 0 Name 'ABC' # this one
| 1 Punctuation ','
| 2 Whitespace ' '
+-3 Identifier 'XYZ'
| 0 Name 'XYZ' # this one
| 4 Punctuation ','
| 5 Whitespace ' '
+-6 Identifier 'FNORD'
| 0 Name 'FNORD' # this one
| 2 Punctuation ')'
| 17 Punctuation ';'
Thanks, @vmuriart! I wish I'd known about _pprint_tree
before rolling my own.
This actually makes a lot of sense… anywhere an identifier appears (whether a single Name or Name.Name.Name), it gets its own group.
I probably should've picked a different example, like the following one, where an Oracle DDL statement gets grouped in odd ways
sql = 'CREATE SEQUENCE FOO.BAR_SEQ START WITH 1000 MAXVALUE 100000 NOCYCLE;'
p = sqlparse.parse(sql)[0]
p._pprint_tree()
| 0 DDL 'CREATE'
| 1 Whitespace ' '
+-2 Identifier 'SEQUEN...'
| 0 Name 'SEQUEN...'
| 1 Whitespace ' '
| 2 Name 'FOO'
| 3 Punctuation '.'
+-4 Identifier '"BAR_S...'
| 0 Symbol '"BAR_S...'
| 5 Whitespace ' '
| 6 DML 'START'
| 7 Whitespace ' '
| 8 Keyword 'WITH'
| 9 Whitespace ' '
| 10 Integer '1000'
| 11 Whitespace ' '
| 12 Keyword 'MAXVAL...'
| 13 Whitespace ' '
+-14 Identifier '100000...' # this is strange
| 0 Integer '100000'
| 1 Whitespace ' '
| 2 Name 'NOCYCLE'
| 15 Punctuation ';'
It's probably impossible to get this 100% right without poring through Oracle's ridiculous syntax diagrams for every case… but the grouping of 100000 NOCYCLE
into an Identifier
seems quite unexpected.
Two of the most common pr
s for:
function
/name
Name
tokens. In your case NOCYCLE
is missing from the valid keywords dictionary.
Is SEQUENCE
in position p[2][0] correct, that looks like it should be a keyword. I imagine that FOO.BAR_SEQ
should have been grouped into an identifier instead.
The more interesting thing is how did BAR_SEQ
become "BAR_SEQ"
(quotation marks).
Gotcha, thanks.
SEQUENCE
almost certainly should be a keyword since it's part of the SQL standard and supported by sane databases like PostgreSQL.Name
s in those other databases, and might confuse the parser for non-Oracle databases.By the way, the Oracle docs on keywords and legal identifiers are… simply insane. There are some syntactically meaningful words like NOCYCLE
which not actually reserved words, and can be used in object names even though this "may lead to unpredictable results":
https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm
3. The Oracle SQL language contains other words that have special meanings. These words include datatypes, schema names, function names, the dummy system table DUAL, and keywords (the uppercase words in SQL statements, such as DIMENSION, SEGMENT, ALLOCATE, DISABLE, and so forth). These words are not reserved. However, Oracle uses them internally in specific ways. Therefore, if you use these words as names for objects and object parts, then your SQL statements may be more difficult to read and may lead to unpredictable results.
The more interesting thing is how did BAR_SEQ become "BAR_SEQ" (quotation marks).
That's because I changed the command slightly before pasting the output.
Me ← (derp)
Actually, SEQUENCE
was already added in 1ad556f … but didn't make it into the 0.1.19 release (it was merged right after).
good ol' oracle. I'm just hoping noone requests to support this oracle 'feature' that makes this statement valid
select*from"DUAL";
It might be supported in other languages as well, but none the less; its a case I don't want to encounter.
Feel free to submit pr
s for any missing keywords and relevant tests if needed; especially before 0.2.0
gets finalized. Down the line we might add an option for dialect specific keyword dictionaries to allow/disallow keywords and names as needed.
Below is the tree after merging the changes
0 DDL 'CREATE'
1 Whitespace ' '
2 Keyword 'SEQUEN...'
3 Whitespace ' '
4 Identifier 'FOO.BA...'
| 0 Name 'FOO'
| 1 Punctuation '.'
| 2 Name 'BAR_SEQ'
5 Whitespace ' '
6 DML 'START'
7 Whitespace ' '
8 CTE 'WITH'
9 Whitespace ' '
10 Integer '1000'
11 Whitespace ' '
12 Keyword 'MAXVAL...'
13 Whitespace ' '
14 Integer '100000'
15 Whitespace ' '
16 Keyword 'NOCYCLE'
17 Punctuation ';'
8
shouldn't be CTE
but thats a different story.
That's definitely an improvement. I'm curious about why 100000 NOCYCLE
was previously grouped as an Identifier
.
Is there ever a case where an Integer
is actually used to start an Identifier
… ? Seems like maybe there should be a patch to prevent this even if sqlparse doesn't know what the adjacent words mean.
10000 is being classed as an integer while (back then) nocycle was a "name", sqlparse
was identifying it as being an alias. similar to:
select 1 aliasname from dual
where the column name will be aliasname
in otherword, because 100000
has an "alias" sqlparse grouped it into an identifier.
Well now that just… totally makes sense. I was thinking maybe you shouldn't make this kind of guess in a DDL statement, but then I realized that this would mess up cases like:
CREATE TABLE junk AS select 1 aliasname from dual
SQL is … >:-(
Take a DDL statement like this and
parsestream
it:This yields a single
sqlparse.sql.Statement
(subclass ofTokenList
). ThisTokenList
in turn contains further nestedTokenList
.Some of these make sense to me, as when
FOO.BAR
is grouped as a nestedTokenList
, but in several cases there is a nestedTokenList
that contains only a single token…What is the logic here? How can I predict when I'll get a nested TokenList, as opposed to a single token?
I'm using sqlparse v0.1.19 with Python 2.7, and the following recursive function to print the nested representation of a blob of SQL text: