apache / datafusion-sqlparser-rs

Extensible SQL Lexer and Parser for Rust
Apache License 2.0
2.76k stars 527 forks source link

Add Span to Tokens, AST nodes #161

Open mullr opened 4 years ago

mullr commented 4 years ago

I'd like to use this library for some refactoring and analysis tools. It would be very helpful to have source line/column information available on the AST nodes, something along the lines of Span from syn. (https://docs.rs/syn/1.0.17/syn/spanned/trait.Spanned.html)

I see that line/column is already tracked in the tokenizer. I'm thinking something along the lines of:

struct LineColumn {
    pub line: usize,
    pub column: usize,
}

struct Span {
  pub start: LineColumn,
  pub end: LineColumn,
}

struct TokenOccurence {
  pub token: Token,
  pub span: Span
}

struct Query {
  pub span: Span,
  pub ctes: Vec<Cte>,
 ...
}

The main objection I can think of is that it some bulk to Tokens, and to each AST node. This could be alleviated, if desired, by hiding them behind a feature flag. I looked at syn to see how they do it; the approach there is very similar. Afict, their feature flagging doesn't actually affect the node representation / token representation. They just carry it all the time.

Since the implementation here doesn't appear to be hyper-optimized, and since it's good enough for syn, it seems like this approach should be okay.

If I submitted a PR along these lines, would you consider accepting it?

Related work:

nickolay commented 4 years ago

If I submitted a PR along these lines, would you consider accepting it?

I hope @andygrove and @benesch would comment.

As I'm also interested in analysis, I'd love something like this implemented.

Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)

A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Spans, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.

mullr commented 4 years ago

Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)

I don't think it'll be too bad. I've hacked on this a little, and the initial approach I took was to exclude span fields from equality / hash computation, as they pretty clearly count as 'metadata' and not the data itself. But I haven't gotten as far as updating the test cases yet.

A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Spans, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.

Their design goal of Syntax trees are lossless, or full fidelity. All comments and whitespace are preserved. is what I actually want in the end. With Span information, on both the token stream and the AST, I should be able to reverse out comment / ast node associations. But having an intermediate homogeneous tree (CST?) would be a much better option.

mullr commented 4 years ago

I've pushed up a 'span' branch on my fork that implements a small slice of this, to serve as discussion fodder: https://github.com/mullr/sqlparser-rs/commits/span

Adding Spans to the tokenizer out put was very easy. Getting them into the guts of the parser was a bit more, but straightforward. I added it to a single AST node (Ident), and I think I arrived at a pattern that isn't too disruptive, especially to the tests.

andygrove commented 4 years ago

Sorry, I probably won't have time to look at this until the weekend, but I will review this.

nickolay commented 4 years ago

The tests aren't affected, because of the #[derivative(PartialEq = "ignore")] on the new span member of Ident, right? Nice!

A few quick thoughts:

mullr commented 4 years ago

The changes introducing TokenOccurrence to the parser turn out to be more involved than I expected, in particular having to match on TokenOccurrence { token: Token::Word(ref k), .. } is unfortunate, but I don't have any ideas on how to improve it off-hand.

Yeah, it felt pretty bad writing that, but I don't have a better idea either.

I now wonder what is the rule to determine if a span must be stored for a given bit of the AST. Take pub order_by: Vec in Query. Would it be enough for the spans to be added only to existing nodes (Query and OrderByExpr)? What about pub rows: OffsetRows in Offset? pub asc: Option in OrderByExpr?

It's a tricky question, especially because the AST here is especially abstract. I would probably put it in places where there's already a struct that clearly corresponds to a piece of syntax, but wouldn't create new pieces of structure just to staple a Span on to.

In your examples: OrderByExpr, yes. Query, yes. Query.offset: yes, but because it's an Expr, and that would get a span. asc field of OrderByExpr: No.

A truly convincing answer to the question would probably require an intermediate representation, as discussed above.

It seems that passing the spans explicitly throughout the parser would be the biggest pain, not the tests.

Yep, that's likely where the bulk of the work is. "A pain" is a good way to describe it, as it should be fairly rote.

mullr commented 4 years ago

One thing to note: only leaf AST nodes actually require the span to be attached to them as a field. Intermediate nodes can have an implementation if the Spanned trait which computes their span based on their children.

nickolay commented 4 years ago

Intermediate nodes can have an implementation if the Spanned trait which computes their span based on their children.

I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr example, it represents the ORDER BY <expr> ASC/DESC string in the source, but the only "leaf" node under it is Expr. So it seems like it needs a span too, and that logic applies to pretty much every node, no?

mullr commented 4 years ago

I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr example, it represents the ORDER BY ASC/DESC string in the source, but the only "leaf" node under it is Expr. So it seems like it needs a span too, and that logic applies to pretty much every node, no?

You're absolutely right, I hadn't thought of it that way.

cswinter commented 4 years ago

Some way of accessing spans would also enable better error reporting in e.g. LocustDB since it makes it possible to point to specific spans within the original query string.

datocrats-org commented 2 years ago

I wanted to propose an approach that may be verbose but could help at least in testing and in deriving a way to output SQL from AST per #18 (which I think would require spans to tokens implement). My goal is to read how each portion of SQL is parsed and eventually rewritten, and to choose how any comments are maintained or removed during the rewrite. I'd like to be able to visualize them in a git diff like view comparing the old to new, just stored within the AST somehow, and giving me the opportunity to omit unnecessary extra comments if needed.

aochagavia commented 7 months ago

@alamb I saw you stepped up as a maintainer of the library and that you showed interest in getting https://github.com/sqlparser-rs/sqlparser-rs/pull/189 restarted. I'm not sure I'll have time to resurrect that PR, but just in case: is this issue still relevant to the project? And has anyone else worked on it since that PR?

alamb commented 7 months ago

Thanks @aochagavia -- The issue is definitely still relevant. It is an often requested feature

It came up recently here, in fact: https://github.com/sqlparser-rs/sqlparser-rs/pull/839#issuecomment-1996879328

alamb commented 7 months ago

I started collecting related work / tickets in the description of this ticket so we can track it more easily

mamcx commented 1 month ago

I'm also interested in being able to provide better/pretty errors like with https://lib.rs/crates/ariadne or https://github.com/zkat/miette.

Adding text spans to:

pub struct Location {
    /// Line number, starting from 1
    pub line: u64,
    /// Line column, starting from 1
    pub column: u64,

   pub start:u64, 
   pub end:u64,
}

Is probably a first step.

mamcx commented 1 month ago

We are trying to provide better error messages at least, so I forked and added:

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ParserError {
    TokenizerError(String),
    ParserError(String),
    ParserErrorWithToken {
        msg: String,
        token: TokenWithLocation,
    },
    RecursionLimitExceeded,
}

Then I override the major parser methods and now I could do something like:

Error: SQL parser error: unsupported binary operator LIKE
   ╭─[SQL:1:1]
   │
 1 │ SELECT * FROM c WHERE a LIKE 1
   │                  ──┬─  
   │                    ╰─── Error here
───╯

I locate the spans with the info of the column, line and split the original SQL.

I think if the pr is split into 2 steps will make things simpler:

Nyrox commented 1 month ago

Hi! I am looking to pick this up and I can see that there is at least 2 PRs with an attempt at implementing this #839 and #790 and @mamcx fork. Is there any opinions on which approach is most likely to make it in and which would make most sense as a starting point?

alamb commented 1 month ago

Is there any opinions on which approach is most likely to make it in and which would make most sense as a starting point?

Thank you @Nyrox -- we certainly need a hero to make this happen

The key in my mind is to find a way to incrementally introduce this feature over a few releases so downstream users have time to integrate / update the changes and we minimize disruption

mkarbo commented 1 month ago

@alamb would you be open to collaborate on how we should approach this if we take on the dev task and workload (@Nyrox and team)? We just want to ensure changes can and will be upstreamed

Nyrox commented 1 month ago

The key in my mind is to find a way to incrementally introduce this feature over a few releases so downstream users have time to integrate / update the changes and we minimize disruption

I think this is a good idea. I have implemented in my fork source locations for a good subset of AST nodes in a way which minimally breaks downstream (essentially by avoiding the WithSpan<T> approach of the other PRs and only adding fields instead). This does not yield perfect results, i.e. for a lot of constructs the span will be slightly wrong due to missing information, but this way it can be refined in later PRs.

@alamb should I open a draft PR where we can discuss details?

alamb commented 1 month ago

@Nyrox and @mkarbo yes please let's move the discussion to a PR

cc @iffyio @lovasoa and @jmhain