Closed sjrusso8 closed 1 year ago
Oh! I didn't see your PR when I did mine: #690 Should I pull in your changes or do you want to pull in mine? We should definitely only have one PR about this.
I think this project does not have a high priority for the maintainer. My last PR #643 was only accepted after 11 months. Hey @andialbrecht, would you consider to enable other maintainers into this project? Alternatively, @sjrusso8, do you think maybe we should publish a fork of this package? I am already a member of an open source community atc-net where we maintain some helper classes that we use across several of our projects. Maybe we should open a fork repo there?
@mrmasterplan I updated this PR with some of your changes and added unit tests. I'll follow your lead on where you want to fork sqlparse code for atc-net :)
I did adjust a few of your proposed changes from #690.
We should keep spark data sources (TXT, DELTA, PARQUET, JSON, etc.) as identifiers and not keywords. It can create some weird side effects if you have a statement like SELECT * FROM csv
. The table ref for csv
in this case would be flagged as a keyword, but all other table refs are identifiers.
Changed SORTED BY
, PARTITIONED BY
, and CLUSTERED BY
to match the both words as a single keyword cause SparkSQL expects those to be grouped together
Adjusted the WITH DBPROPERTIES
statement because the parser wanted to treat WITH
as a CTE classifier
Nice work @sjrusso8. I agree with all of your changes. I have marked my own PR as abandoned.
@andialbrecht I just want to clarify my comment above. I meant no disrespect. I have huge respect for the work you do as the principal author and maintainer of this decently popular open source library. I realize that you are doing a huge service to the community with no compensation by addressing our issues here. My comments were merely meant to address the fact that this PR contributes something that enables a feature on my downstream project and I am interested in a speedy solution.
@mrmasterplan no worries! I know that it seems from time to time that this project isn't maintained well. But in fact I'm actively monitoring all incoming issues and topics. To keep my personal schedule somewhat clean, I used to work "in blocks" on this project. The drawback is that some valuable work of others in pull requests, gets stuck for a while.
@sjrusso8 pull request #693 contains very valuable work by @mrmasterplan that makes it much easier - and cleaner - to add additional keywords to sqlparse. I'd suggest to wait until this pr has landed in master and then rebase your work on the new mechanism for adding keywords as an extra dialect that's not included in the default parser.
@andialbrecht for sure! @mrmasterplan latest PR makes adding niche dialects WAY easier
This PR is to add frequently used Databricks and Delta table syntax. Delta is the main storage format for the Databricks platform for the whole 'Data Lakehouse' paradigm. Databricks SQL has a lot of special operations to work with Delta tables which means a lot of new keywords.
Here is an example of standard operations of Databricks SQL for a created Delta table.
Then operating on those statements should parse out additional keywords like below.
Please refuse this PR if this is too much of a change. There are a lot of additional niche Databricks SQL operations that are not covered by this PR, but I can add them if needed.