andialbrecht / sqlparse

A non-validating SQL parser module for Python
BSD 3-Clause "New" or "Revised" License
3.7k stars 692 forks source link

Give indication about position of each token seen #82

Open ChaosEternal opened 11 years ago

ChaosEternal commented 11 years ago

Suggestion is add methods like col and row on class Token , which would make life like sql conversion easier.

piranna commented 11 years ago

I think I don't understand... column number can be fetch from the position of the identifier inside the IdentifiersList and this number would change if you add or remove them (but it's true that would be able to be added a property to get it easier), but row? What do you want to say with "row"? The order of a statement inside a StatementsList?

ChaosEternal commented 11 years ago

sorry, let me explain it.

the column number is the position of the token from beginning of it's line. the row number is line number of the token from begining of file.

think about thousands of sql queries, each has thousands of lines, and a tool is wanted to rewrite some parts of the queries and leave other parts intact.

piranna commented 11 years ago

I still don't get the catch of the rows and columns, as I told you this number can change, and also sqlparse allow to change specific fragments of the SQL queries (and in fact is better to do it in a relatively form). Maybe the functionality that you are requesting correspond to a higher level library?

@andialbrecht, can you be able to give us light on this topic? :-)

ChaosEternal commented 11 years ago

sorry for disturb , and I wrote some dirty code to explain this. look attachement

the 2.sql is as following:

WITH ATBL AS (SELECT

sed -e "python convwith2temptable2.py 2.sql |sed 's/$/;/'" 2.sql

it will give us this:

WITH DROP TABLE IF EXISTS ATBL; CREATE TABLE ATBL AS (SELECT

Thanks.

On Tue, Dec 4, 2012 at 4:19 PM, Jesús Leganés Combarro < notifications@github.com> wrote:

I still don't get the catch of the rows and columns, as I told you this number can change, and also sqlparse allow to change specific fragments of the SQL queries (and in fact is better to do it in a relatively form). Maybe the functionality that you are requesting correspond to a higher level library?

@andialbrecht https://github.com/andialbrecht, can you be able to give us light on this topic? :-)

— Reply to this email directly or view it on GitHubhttps://github.com/andialbrecht/sqlparse/issues/82#issuecomment-10987939.

ChaosEternal commented 11 years ago

as you mentioned, the row number and column number will change, and so the code is unsafe.

andialbrecht commented 11 years ago

I could think of another use case: In CrunchyFrog, the db frontend/SQL editor where sqlparse originated from, I need to know where a statement starts and ends to highlight it in the GUI. So col/row attributes on tokens would be helpful here too. Also when jumping to specific parts within a statement ("go to WHERE clause/JOIN definitions" would be a nice feature in the editor).

But as @piranna already noted, the values in the hypothetical cols/rows attributes would only be valid after initial parsing. During initial parsing all whitespaces are preserved so that the lexer should be able to calculate the initial positions (I'd throw char offset in the ring then too)

But as soon as filters are applied those properties could be invalid and we have no easy way to tell if they're still valid. OTOH, maybe that's just a documentation / naming issue...

andialbrecht commented 9 years ago

See also #102

idank commented 8 years ago

Throwing another comment here to say I'd like to see this implemented, for a tool I'm working on. It seems pretty basic to have position information in a parser. :)

piranna commented 8 years ago

Pull-request are welcome :-) El 15/2/2016 1:52, "Idan Kamara" notifications@github.com escribió:

Throwing another comment here to say I'd like to see this implemented, for a tool I'm working on. It seems pretty basic to have position information in a parser. :)

— Reply to this email directly or view it on GitHub https://github.com/andialbrecht/sqlparse/issues/82#issuecomment-184015281 .

vmuriart commented 8 years ago

@idank do you mean absolute position? nth token from the beginning of statement?

idank commented 8 years ago

I meant absolute position. I ended up with this hack, for those interested:

def calcpositions(root, index=0):
    positions = {}

    if root.is_group():
        startindex = index
        for token in root.tokens:
            nestedpos, index = calcpositions(token, index)
            positions.update(nestedpos)
        positions[root] = (startindex, index)
    else:
        positions[root] = (index, index+len(root.value))
        index += len(root.value)

    return positions, index

It returns a {token: (start, end)} dictionary and should be called with sqlparse.parse(text)[0].

vmuriart commented 8 years ago

it's around the lines of what I was going to suggest, though I had misunderstood and thought you meant index as position of "whole" token rather than start and end characters.

I modified my would be suggestion but its essentially the same, with the exception of mine skipping the groups and just does the individual tokens.

from __future__ import print_function
import sqlparse

def calcpositions(root, index=0):
    positions = {}

    if root.is_group():
        startindex = index
        for token in root.tokens:
            nestedpos, index = calcpositions(token, index)
            positions.update(nestedpos)
        positions[root] = (startindex, index)
    else:
        positions[root] = (index, index + len(root.value))
        index += len(root.value)

    return positions, index

def calcpositions2(stmt):
    index = 0
    positions = {}
    for token in stmt.flatten():
        positions[token] = (index, index + len(str(token)))
        index += len(str(token))
    return positions

if __name__ == '__main__':
    sql = "CREATE TABLE test();"
    stmt = sqlparse.parse(sql)[0]

    print(calcpositions(stmt))
    print("##############")
    print(calcpositions2(stmt))

({<Identifier 'test' at 0x1E9BC10>: (13, 17), <DDL 'CREATE' at 0x1EA0458>: (0, 6), <Whitespace ' ' at 0x1EA04A8>: (6, 7), <Keyword 'TABLE' at 0x1EA04F8>: (7, 12), <Whitespace ' ' at 0x1EA0548>: (12, 13), <Name 'test' at 0x1EA0598>: (13, 17), <Punctuation '(' at 0x1EA05E8>: (17, 18), <Punctuation ')' at 0x1EA0638>: (18, 19), <Punctuation ';' at 0x1EA0688>: (19, 20), <Parenthesis '()' at 0x1E9BB40>: (17, 19), <Statement 'CREATE...' at 0x1E9BBA8>: (0, 20)}, 20)
##############
{<Punctuation ')' at 0x1EA0638>: (18, 19), <DDL 'CREATE' at 0x1EA0458>: (0, 6), <Punctuation ';' at 0x1EA0688>: (19, 20), <Whitespace ' ' at 0x1EA04A8>: (6, 7), <Keyword 'TABLE' at 0x1EA04F8>: (7, 12), <Whitespace ' ' at 0x1EA0548>: (12, 13), <Name 'test' at 0x1EA0598>: (13, 17), <Punctuation '(' at 0x1EA05E8>: (17, 18)}
vmuriart commented 8 years ago

though nicely compliments the token_at_offeset function.

idank commented 8 years ago

Yeah I happen to need position of groups too.

This whole thing falls on its face if the parser decides to suppress parts of the input one day, say whitespace (like most parsers do). It works for now though.

andialbrecht commented 8 years ago

The parser should never suppress whitespaces. IOW during parsing it should be possible to restore the original input. This is required in some use cases.

sjoerdjob commented 8 years ago

Funny thing is that the position is already known in Lexer.py. this information gets ignored again. Why not add it to the token instead?

Of course, calculating row & column might make it a bit harder. But it should be reasonably trivial to add in the lexer.

vmuriart commented 8 years ago

The problem was that once filters are applied to the lexer their row/columns are modified. At the moment parse doesn't take any filters that would make this a problem; but in the case it was allowed to take filters (i was planning on adding the capability later) their position would shift.

Thinking about it though, it shouldnt be too complicated to add a component to the filters to update this information before it finishes processing

sjoerdjob commented 8 years ago

Depends on the information you want. If what you want (most likely) the position where it originally occurred in the stream/file/string, then doing it in the lexer makes sense.

In reailty, sqlparse can also be used for reformating and other position-changing purposes. In that case, yes, the whole concept of position breaks down in general. Still, I think adding the functionality with that disclaimer makes sense.

lelit commented 7 years ago

I second the need to have (back) a way to get the position of each token. In sqlparse > 0.2 the Lexer class has been simplified (it is currently a mere namespace for a single classmethod) and the tokenize() function that uses it does not return the position anymore.

One of my tools implements poor man's fake domains on databases missing the domain concept, recognizing CREATE DOMAIN foo bar and mechanically replacing foo with bar in the statements when this is the case. I had to reimplement the old behaviour to be able to do that.

muhmud commented 3 years ago

I also need this for something I'm working on!

IMO, it makes complete sense to provide this for parse, and then simply invalidate the information as and when it becomes invalidated.