apache / datafusion-sqlparser-rs

Extensible SQL Lexer and Parser for Rust
Apache License 2.0
2.75k stars 527 forks source link

Thoughts on SQL Extensions #104

Closed ChristopherMacGown closed 4 years ago

ChristopherMacGown commented 5 years ago

I've been playing around with a trait based solution to allowing dialects to carry their own extension definitions. Would something along these lines be accepted?

Obviously the example is a toy, and there would need to be some additional functions defined on the trait to be able to extract the AST from the boxed Trait, but I thought it was better to get feedback before rushing off and implementing.

nickolay commented 5 years ago

Welcome! We don't have specific proposals at hand on tackling the extensibility/dialects problem. I know @andygrove intended to work on this, but he's been quite busy outside the project recently. Glad to see you thinking about this!

I thought I'd share my thoughts here, in the hope it will help clarify the proposal. I hope that @benesch will comment too, but I guess we'll need Andy's sign-off before settling on a decision.

I'd like to learn a bit about your perspective. For example, my interest is in analysing complex sets of queries written in various existing dialects. For that reason I'm interested in a dialect-agnostic parser, where dialects provide a way to pick between conflicting parsing rules. But I believe that in order to be maintainable and usable for other purposes (like for validation or in a query engine) the project needs to grow better support for dialects.

As for the prototype you shared, I can't yet say I understand the idea very well or how it'd apply to existing dialect-specific code, for example:

ChristopherMacGown commented 5 years ago

The company I work for analyzes complex queries all written in the same dialect and it was pretty trivial to add VACUUM to the parser the same way that other extensions have. Long-term, though, we expect to support other dialects as well, and trying to keep up with all of the features seems non-trivial and expensive especially if the dialects and their dialect specific SQL aren't accepted by upstream. Also, since I'm the only one at the company right now who knows Rust, anything that I can do to simplify adding dialect specific SQL would go a long way toward our being able to adopt this package for our needs.

Going back to the example…

is the scope of the solution limited to implementing custom keywords outside of the mainline crate, or are you thinking of migrating (all? some?) existing dialect-specific code to extensions?

If I were to implement a PR implementing an extension trait and the modifications to be able to parse them, I think it would be necessary to implement the existing dialect specific sql in the new way. First because it'll make testing the ergonomics of the implementation a lot easier, but also because it would be really confusing for a new developer if there were two ways that dialect specific stuff was implemented. After the dialect-specific code was moved into the dialects themselves, I guess the dialects could also be moved out of mainline eventually.

it seems that each "extension" is keyed off a keyword. Where in the parser would the extension's parsing method be invoked and how the parse result would be stored in the AST? If the scope of extensions was limited to separate statements like VACUUM, I'd guess Parser::parse_statement, but I can't yet see what mechanics would support both the VACUUM and the STRAIGHT_JOIN extensions.

I think I may have over-simplified there, assuming that I could get away with not defining the parsing mechanism on Parser. I think you're right, that the keyword based dialect specific sql would be parsed out of Parser::parse_statement, although there would need to be a way to pass in an Iterator or a HashMap of (KEYWORD, Extension) tuples. You're also right that as written, this proposal won't support STRAIGHT JOIN. When I went down this path, I wasn't even looking at the match {...} in Parser::parse_statement, but the keywords in ::dialect. The goal was that the Dialect extensions would carry their own reserved words. I need to think more about how a trait implementation could implement something like STRAIGHT JOIN, but if you have any ideas I'm game to hear them!

😄

[edit] I think we could have another Trait DialectParser that provides parse_* that match the function signatures on Parser, and in the case of dialect specific join conditions an implementor would have to add the Extension and also implement DialectParser::parse_joins.

nickolay commented 5 years ago

Thanks, I think our goals are quite aligned!

if you have any ideas I'm game to hear them!

I wanted to find and re-read the previous discussions on this, in which a number of good points have already been made, before sharing more ideas of my own. It will take a while, because reviewing PRs is higher priority for me.

Trait DialectParser

Yep, that's an option. The question then is how to integrate the extensions in the AST, as Parser::parse_joins returns Vec<Join>, which isn't extensible. Should it return a Box from all the parser methods - essentially changing the AST to be untyped?

Overriding parser's methods also involves copying much of the shared logic - in the joins example you'd copy a page of code just to add one additional join type. Yet, unless we freeze the DialectParser trait, the dialects will not really be independent from the maintenance perspective... We could introduce additional extension points in such cases, but it might be easier to just have some dialect-specific parsing in the mainline parser, enabled conditionally.

ChristopherMacGown commented 5 years ago

I think in the case of Parser::parse_joins, Join would have to become an enum in order to provide the typed extensibility. Then the dialect specific code could return Option<Join> and if it's None, Parser::parse_joins could continue with its normal logic without requiring too much copying and pasting. That's potentially a really big API change, but maybe still doable because the package is sub-1.0?

andygrove commented 5 years ago

Hi @ChristopherMacGown Thanks for suggesting an approach to support specific dialects. It's good to see a discussion happening on this.

It certainly is challenging making this extensible without the need to duplicate code.

To add to the use case discussion I would imagine many users just want a parser for a specific dialect they are working with (HiveQL in my case).