Add Databricks & Delta Table Operation Support

sjrusso8 commented 2 years ago

This PR is to add frequently used Databricks and Delta table syntax. Delta is the main storage format for the Databricks platform for the whole 'Data Lakehouse' paradigm. Databricks SQL has a lot of special operations to work with Delta tables which means a lot of new keywords.

Here is an example of standard operations of Databricks SQL for a created Delta table.

CREATE TABLE IF NOT EXISTS default.event 
(
    id INT, 
    name STRING, 
    description VARCHAR(30)
)
USING delta
LOCATION '/mnt/data/location'
PARTITIONED BY (id)
COMMENT 'this is a comment'
TBLPROPERTIES (
    'foo'='bar',
    delta.autoOptimize.optimizeWrite = true, 
    delta.autoOptimize.autoCompact = true
);

OPTIMIZE event 
WHERE date >= current_timestamp() - INTERVAL 1 day 
ZORDER BY (id);

VACUUM event;

CREATE BLOOMFILTER INDEX ON TABLE event 
FOR COLUMNS(description OPTIONS (fpp=0.1, numItems=50000000));

CREATE TABLE default.event_clone SHALLOW CLONE default.event;

DESCRIBE HISTORY event;

DESCRIBE TABLE EXTENDED event;

SHOW DETAIL event;

MSCK REPAIR TABLE event SYNC METADATA;

REFRESH TABLE event;

Then operating on those statements should parse out additional keywords like below.

statements = sqlparse.parse(sql)

for statement in statements:
    result = [v.value for v in sqlparse.sql.IdentifierList(statement.tokens).get_identifiers() if v.is_keyword]

    print(result)

>>> ['CREATE', 'TABLE', 'IF', 'NOT', 'EXISTS', 'USING', 'LOCATION', 'PARTITIONED BY', 'COMMENT', 'TBLPROPERTIES']
>>> ['OPTIMIZE', 'ZORDER BY']
>>> ['VACUUM']
>>> ['CREATE', 'BLOOMFILTER INDEX', 'ON', 'TABLE', 'FOR']
>>> ['CREATE', 'TABLE', 'SHALLOW CLONE']
>>> ['DESCRIBE', 'HISTORY']
>>> ['DESCRIBE', 'TABLE', 'EXTENDED']
>>> ['SHOW', 'DETAIL']
>>> ['MSCK REPAIR', 'TABLE', 'SYNC', 'METADATA']
>>> ['REFRESH', 'TABLE']

Please refuse this PR if this is too much of a change. There are a lot of additional niche Databricks SQL operations that are not covered by this PR, but I can add them if needed.

mrmasterplan commented 1 year ago

Oh! I didn't see your PR when I did mine: #690 Should I pull in your changes or do you want to pull in mine? We should definitely only have one PR about this.

mrmasterplan commented 1 year ago

I think this project does not have a high priority for the maintainer. My last PR #643 was only accepted after 11 months. Hey @andialbrecht, would you consider to enable other maintainers into this project? Alternatively, @sjrusso8, do you think maybe we should publish a fork of this package? I am already a member of an open source community atc-net where we maintain some helper classes that we use across several of our projects. Maybe we should open a fork repo there?

sjrusso8 commented 1 year ago

@mrmasterplan I updated this PR with some of your changes and added unit tests. I'll follow your lead on where you want to fork sqlparse code for atc-net :)

I did adjust a few of your proposed changes from #690.

We should keep spark data sources (TXT, DELTA, PARQUET, JSON, etc.) as identifiers and not keywords. It can create some weird side effects if you have a statement like SELECT * FROM csv. The table ref for csv in this case would be flagged as a keyword, but all other table refs are identifiers.
Changed SORTED BY, PARTITIONED BY, and CLUSTERED BY to match the both words as a single keyword cause SparkSQL expects those to be grouped together
Adjusted the WITH DBPROPERTIES statement because the parser wanted to treat WITH as a CTE classifier

mrmasterplan commented 1 year ago

Nice work @sjrusso8. I agree with all of your changes. I have marked my own PR as abandoned.

mrmasterplan commented 1 year ago

@andialbrecht I just want to clarify my comment above. I meant no disrespect. I have huge respect for the work you do as the principal author and maintainer of this decently popular open source library. I realize that you are doing a huge service to the community with no compensation by addressing our issues here. My comments were merely meant to address the fact that this PR contributes something that enables a feature on my downstream project and I am interested in a speedy solution.

andialbrecht commented 1 year ago

@mrmasterplan no worries! I know that it seems from time to time that this project isn't maintained well. But in fact I'm actively monitoring all incoming issues and topics. To keep my personal schedule somewhat clean, I used to work "in blocks" on this project. The drawback is that some valuable work of others in pull requests, gets stuck for a while.

andialbrecht commented 1 year ago

@sjrusso8 pull request #693 contains very valuable work by @mrmasterplan that makes it much easier - and cleaner - to add additional keywords to sqlparse. I'd suggest to wait until this pr has landed in master and then rebase your work on the new mechanism for adding keywords as an extra dialect that's not included in the default parser.

sjrusso8 commented 1 year ago

@andialbrecht for sure! @mrmasterplan latest PR makes adding niche dialects WAY easier

andialbrecht / sqlparse

Add Databricks & Delta Table Operation Support #687