Open mullr opened 4 years ago
If I submitted a PR along these lines, would you consider accepting it?
I hope @andygrove and @benesch would comment.
As I'm also interested in analysis, I'd love something like this implemented.
Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)
A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Span
s, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.
Do you think this can be implemented without adding too much boilerplate? (In the tests, in particular, they're rather verbose as it is.)
I don't think it'll be too bad. I've hacked on this a little, and the initial approach I took was to exclude span
fields from equality / hash computation, as they pretty clearly count as 'metadata' and not the data itself. But I haven't gotten as far as updating the test cases yet.
A tangentially related issue I've been mulling over is the "AST" types we have now and how can they support dialects. For a while now I wanted to experiment with adopting the rust-analyzer's design of having a separate untyped/homogenous syntax tree (similar to your Spans, but with additional "type tag" and some structure), and a (generated) strongly typed API for accessing it. The first layer solves the problem you brought up, and the expected results can be more compact than ours.
Their design goal of Syntax trees are lossless, or full fidelity. All comments and whitespace are preserved.
is what I actually want in the end. With Span information, on both the token stream and the AST, I should be able to reverse out comment / ast node associations. But having an intermediate homogeneous tree (CST?) would be a much better option.
I've pushed up a 'span' branch on my fork that implements a small slice of this, to serve as discussion fodder: https://github.com/mullr/sqlparser-rs/commits/span
Adding Spans to the tokenizer out put was very easy. Getting them into the guts of the parser was a bit more, but straightforward. I added it to a single AST node (Ident), and I think I arrived at a pattern that isn't too disruptive, especially to the tests.
Sorry, I probably won't have time to look at this until the weekend, but I will review this.
The tests aren't affected, because of the #[derivative(PartialEq = "ignore")]
on the new span
member of Ident
, right? Nice!
A few quick thoughts:
TokenOccurrence { token: Token::Word(ref k), .. }
is unfortunate, but I don't have any ideas on how to improve it off-hand.pub order_by: Vec<OrderByExpr>
in Query
. Would it be enough for the spans to be added only to existing nodes (Query
and OrderByExpr
)? What about pub rows: OffsetRows
in Offset
? pub asc: Option<bool>
in OrderByExpr
?The changes introducing TokenOccurrence to the parser turn out to be more involved than I expected, in particular having to match on TokenOccurrence { token: Token::Word(ref k), .. } is unfortunate, but I don't have any ideas on how to improve it off-hand.
Yeah, it felt pretty bad writing that, but I don't have a better idea either.
I now wonder what is the rule to determine if a span must be stored for a given bit of the AST. Take pub order_by: Vec
in Query. Would it be enough for the spans to be added only to existing nodes (Query and OrderByExpr)? What about pub rows: OffsetRows in Offset? pub asc: Option in OrderByExpr?
It's a tricky question, especially because the AST here is especially abstract. I would probably put it in places where there's already a struct that clearly corresponds to a piece of syntax, but wouldn't create new pieces of structure just to staple a Span on to.
In your examples: OrderByExpr
, yes. Query
, yes. Query.offset
: yes, but because it's an Expr, and that would get a span. asc
field of OrderByExpr
: No.
A truly convincing answer to the question would probably require an intermediate representation, as discussed above.
It seems that passing the spans explicitly throughout the parser would be the biggest pain, not the tests.
Yep, that's likely where the bulk of the work is. "A pain" is a good way to describe it, as it should be fairly rote.
One thing to note: only leaf AST nodes actually require the span to be attached to them as a field. Intermediate nodes can have an implementation if the Spanned
trait which computes their span based on their children.
Intermediate nodes can have an implementation if the Spanned trait which computes their span based on their children.
I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr
example, it represents the ORDER BY <expr> ASC/DESC
string in the source, but the only "leaf" node under it is Expr
. So it seems like it needs a span too, and that logic applies to pretty much every node, no?
I think that would only work for the rowan-like design, where concatenating the leaf nodes' text returns the original source. Coming back to the OrderByExpr example, it represents the ORDER BY
ASC/DESC string in the source, but the only "leaf" node under it is Expr. So it seems like it needs a span too, and that logic applies to pretty much every node, no?
You're absolutely right, I hadn't thought of it that way.
Some way of accessing spans would also enable better error reporting in e.g. LocustDB since it makes it possible to point to specific spans within the original query string.
I wanted to propose an approach that may be verbose but could help at least in testing and in deriving a way to output SQL from AST per #18 (which I think would require spans to tokens implement). My goal is to read how each portion of SQL is parsed and eventually rewritten, and to choose how any comments are maintained or removed during the rewrite. I'd like to be able to visualize them in a git diff like view comparing the old to new, just stored within the AST somehow, and giving me the opportunity to omit unnecessary extra comments if needed.
retain-original-sql
retain-prior-lines
and retain-next-lines
retain-comments-only
or retain-all
----
instead of two --
to help distinguish what was a historical comment or /*/*
instead of /*/*
for multi-line comments, two fags retain-flag-prefix-single='--'
and retain-flag-prefix-multi='/*'
retain-no-output=['completed issue num','commenter id char']
@alamb I saw you stepped up as a maintainer of the library and that you showed interest in getting https://github.com/sqlparser-rs/sqlparser-rs/pull/189 restarted. I'm not sure I'll have time to resurrect that PR, but just in case: is this issue still relevant to the project? And has anyone else worked on it since that PR?
Thanks @aochagavia -- The issue is definitely still relevant. It is an often requested feature
It came up recently here, in fact: https://github.com/sqlparser-rs/sqlparser-rs/pull/839#issuecomment-1996879328
I started collecting related work / tickets in the description of this ticket so we can track it more easily
I'm also interested in being able to provide better/pretty errors like with https://lib.rs/crates/ariadne or https://github.com/zkat/miette.
Adding text spans to:
pub struct Location {
/// Line number, starting from 1
pub line: u64,
/// Line column, starting from 1
pub column: u64,
pub start:u64,
pub end:u64,
}
Is probably a first step.
We are trying to provide better error messages at least, so I forked and added:
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ParserError {
TokenizerError(String),
ParserError(String),
ParserErrorWithToken {
msg: String,
token: TokenWithLocation,
},
RecursionLimitExceeded,
}
Then I override the major parser methods and now I could do something like:
Error: SQL parser error: unsupported binary operator LIKE
╭─[SQL:1:1]
│
1 │ SELECT * FROM c WHERE a LIKE 1
│ ──┬─
│ ╰─── Error here
───╯
I locate the spans with the info of the column, line
and split the original SQL
.
I think if the pr
is split into 2 steps will make things simpler:
Hi! I am looking to pick this up and I can see that there is at least 2 PRs with an attempt at implementing this #839 and #790 and @mamcx fork. Is there any opinions on which approach is most likely to make it in and which would make most sense as a starting point?
Is there any opinions on which approach is most likely to make it in and which would make most sense as a starting point?
Thank you @Nyrox -- we certainly need a hero to make this happen
The key in my mind is to find a way to incrementally introduce this feature over a few releases so downstream users have time to integrate / update the changes and we minimize disruption
@alamb would you be open to collaborate on how we should approach this if we take on the dev task and workload (@Nyrox and team)? We just want to ensure changes can and will be upstreamed
The key in my mind is to find a way to incrementally introduce this feature over a few releases so downstream users have time to integrate / update the changes and we minimize disruption
I think this is a good idea. I have implemented in my fork source locations for a good subset of AST nodes in a way which minimally breaks downstream (essentially by avoiding the WithSpan<T>
approach of the other PRs and only adding fields instead). This does not yield perfect results, i.e. for a lot of constructs the span will be slightly wrong due to missing information, but this way it can be refined in later PRs.
@alamb should I open a draft PR where we can discuss details?
@Nyrox and @mkarbo yes please let's move the discussion to a PR
cc @iffyio @lovasoa and @jmhain
I'd like to use this library for some refactoring and analysis tools. It would be very helpful to have source line/column information available on the AST nodes, something along the lines of
Span
fromsyn
. (https://docs.rs/syn/1.0.17/syn/spanned/trait.Spanned.html)I see that line/column is already tracked in the tokenizer. I'm thinking something along the lines of:
The main objection I can think of is that it some bulk to
Token
s, and to each AST node. This could be alleviated, if desired, by hiding them behind a feature flag. I looked atsyn
to see how they do it; the approach there is very similar. Afict, their feature flagging doesn't actually affect the node representation / token representation. They just carry it all the time.Since the implementation here doesn't appear to be hyper-optimized, and since it's good enough for syn, it seems like this approach should be okay.
If I submitted a PR along these lines, would you consider accepting it?
Related work: