andialbrecht / sqlparse

A non-validating SQL parser module for Python
BSD 3-Clause "New" or "Revised" License
3.71k stars 693 forks source link

Configurable syntax #693

Closed mrmasterplan closed 1 year ago

mrmasterplan commented 1 year ago

This PR makes the Lexer a singleton class. This object carries the configured syntax as instance attributes. A library user who has non-standard syntax requirements is able to adapt the behavior of the Lexer to meet her needs. As an example for how to do this, please see the relevant test:

def test_configurable_syntax():
    sql = """select * from foo BACON SPAM EGGS;"""
    tokens = sqlparse.parse(sql)[0]

    assert list(
        (t.ttype, t.value) for t in tokens if t.ttype not in sqlparse.tokens.Whitespace
    ) == [
        (sqlparse.tokens.Keyword.DML, "select"),
        (sqlparse.tokens.Wildcard, "*"),
        (sqlparse.tokens.Keyword, "from"),
        (None, "foo BACON"),
        (None, "SPAM EGGS"),
        (sqlparse.tokens.Punctuation, ";"),
    ]

    Lexer().add_keywords(
        {
            "BACON": sqlparse.tokens.Name.Builtin,
            "SPAM": sqlparse.tokens.Keyword,
            "EGGS": sqlparse.tokens.Keyword,
        }
    )

    tokens = sqlparse.parse(sql)[0]

    assert list(
        (t.ttype, t.value) for t in tokens if t.ttype not in sqlparse.tokens.Whitespace
    ) == [
        (sqlparse.tokens.Keyword.DML, "select"),
        (sqlparse.tokens.Wildcard, "*"),
        (sqlparse.tokens.Keyword, "from"),
        (None, "foo"),
        (sqlparse.tokens.Name.Builtin, "BACON"),
        (sqlparse.tokens.Keyword, "SPAM"),
        (sqlparse.tokens.Keyword, "EGGS"),
        (sqlparse.tokens.Punctuation, ";"),
    ]
    # reset the syntax for later tests.
    Lexer().default_initialization()
mrmasterplan commented 1 year ago

hi @sjrusso8, check out this PR, I think this would future-proof us against further missing Databricks Spark SQL syntax.

codecov[bot] commented 1 year ago

Codecov Report

Base: 96.90% // Head: 96.95% // Increases project coverage by +0.04% :tada:

Coverage data is based on head (eb916a6) compared to base (8b789f2). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #693 +/- ## ========================================== + Coverage 96.90% 96.95% +0.04% ========================================== Files 20 20 Lines 1520 1544 +24 ========================================== + Hits 1473 1497 +24 Misses 47 47 ``` | [Impacted Files](https://codecov.io/gh/andialbrecht/sqlparse/pull/693?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Andi+Albrecht) | Coverage Δ | | |---|---|---| | [sqlparse/keywords.py](https://codecov.io/gh/andialbrecht/sqlparse/pull/693/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Andi+Albrecht#diff-c3FscGFyc2Uva2V5d29yZHMucHk=) | `100.00% <100.00%> (ø)` | | | [sqlparse/lexer.py](https://codecov.io/gh/andialbrecht/sqlparse/pull/693/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Andi+Albrecht#diff-c3FscGFyc2UvbGV4ZXIucHk=) | `92.18% <100.00%> (+6.47%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Andi+Albrecht). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Andi+Albrecht)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

sjrusso8 commented 1 year ago

@mrmasterplan This is a really interesting addition! Makes extending the keywords way easier for one-off/niche SQL favors.

Would we need to consider how to add in custom SQL regex values into SQL_REGEX ? So specific phrases like 'ZORDER BY' can be captured?

mrmasterplan commented 1 year ago

@sjrusso8 that was exactly the type of usecase I had in mind with the way the Lexer is configured. That is why I kept the file keywords.py mostly unchanged. You can easily design your own Lexer grammar like this:

import re

import sqlparse
from sqlparse import keywords
from sqlparse.lexer import Lexer

lex = Lexer()
lex.clear()

my_regex = (
    re.compile(r"ZORDER\s+BY\b", keywords.FLAGS).match,
    sqlparse.tokens.Keyword,
)

lex.set_SQL_REGEX(keywords.SQL_REGEX[:38] + [my_regex] + keywords.SQL_REGEX[38:])
lex.add_keywords(keywords.KEYWORDS_COMMON)
lex.add_keywords(keywords.KEYWORDS_ORACLE)
lex.add_keywords(keywords.KEYWORDS_PLPGSQL)
lex.add_keywords(keywords.KEYWORDS_HQL)
lex.add_keywords(keywords.KEYWORDS_MSACCESS)
lex.add_keywords(keywords.KEYWORDS)

sqlparse.parse("select * from foo zorder by bar;")

in fact, I just added a test to that effect

mrmasterplan commented 1 year ago

I don't know the documentation build system used in this repo? Can anyone point me in the right direction to learn it? Then I will add documentation about this configurability feature.

andialbrecht commented 1 year ago

@mrmasterplan I just skimmed through the pr and it looks really promising! Thanks a lot. The documentation is located under docs/ and is using Sphinx. You can generate the documentation by changing into this directory and running make html. If you don't have Sphinx installed in your environment you can install it with pip install sphinx.

mrmasterplan commented 1 year ago

Thanks @andialbrecht. I added some documentation. I don't know if it fits your style. It covers a somewhat different topic so it was not straightforward to copy the style of another section.

mrmasterplan commented 1 year ago

So @andialbrecht, anything you would like me to change, or are you prepared to merge and release this?

andialbrecht commented 1 year ago

Thanks a lot for this change!