Closed max-sixty closed 3 months ago
I have some ideas on how to handle this, but as I understand, @aljazerzen is in the middle of a refactor. I've generally found it bad practice to have more than one refactor being worked on at the same time, as the second to be merged in will have a hell of a bad time with it. Ironically, if the organization @max-sixty proposes is in place, this wouldn't be much of an issue.
@aljazerzen what is the state of your refactor, and how far does it encompass? Can you give me an idea of what you are/are not touching so I can respect the work you've done so far?
Another related question:
PRQL
(parser) │ ▲
prql_to_pl │ │ pl_to_prql
│ │
▼ │ json::from_pl
────────►
PL AST PL JSON
◄────────
│ json::to_pl
│
(resolver) │
pl_to_rq │
│
│
▼ json::from_rq
────────►
RQ AST RQ JSON
◄────────
│ json::to_rq
│
rq_to_sql │ (SQL compiler)
▼
SQL
This diagram should be accurate, but isn't quite up to date, since PL and AST aren't exactly the same, and prql_to_pl
technically outputs in prqlc-ast
format. Part of the confusion is the different terminology. Should the "parser" be returning PL instead of (parser) AST? If so, should the PL objects be moved to the parser as well?
The end result might be that everything called by prql_to_pl
(and possibly pl_to_prql
too) exists within the "parser"
Right now it looks like the resolver is actually converting from AST to PL, then to RQ.
I'm in favor of this change - a separate crate is unnecessary and tedious to use.
It will be painful to merge this into my branch, but let's not take that as an excuse to put off improvements on the master branch. In the main prqlc crate I've added use prqlc_ast as ast
into the main lib.rs
, so everywhere else, I (am trying to) have ast imported as use crate::ast
, which would make merging this much easier.
One thing I would ask, is to do this in separate commits (or PRs, just that the master branch contains separate commits in the end).
(PS: my branch is now closer to a complete rewrite of semantic
module than a refactor. To merge that, we will have to make changes to the language itself)
The diagram in the lib should be updated, I agree, but we need to rethink how we see the compiler stages.
Also, a bit of terminology decision: is PL an Abstract Syntax Tree? is RQ an Abstract Syntax Tree?
When the prql-ast was split into separate crate, it assumed that PL and RQ are IRs, not an AST. I think that the prqlc-ast, PL and RQ should all be named ASTs, because it is nice to have a name that covers all of these representations.
If we go with this, we need to come up with a nice name for the prqlc-ast. We floated a few ideas before, we just need to pick one. I suggest one of "Parser Syntax Tree" / "Parser Tree", with abbreviations PST / PT.
Something that would probably help with code readability would be to use module-qualified names of AST nodes:
- use crate::ir::pl::Expr;
-
- fn print(e: Expr) -> ...
+ use crate::ir::pl;
+
+ fn print(e: pl::Expr) -> ...
This is what we do at edgedb, and it is a life saver. It also means that when you add a piece of code that needs pl::InterpolationItem
, you probably don't need to add a new import, because pl
is probably already imported.
These are stages of the compiler as I see them now:
stage | sub-stage | AST | comment |
---|---|---|---|
parse | lexer | string -> tokens | |
parse | parser | tokens -> prqlc-ast PT | |
semantic | ast_expand | PL | bad name, this could be called "desugar" |
semantic | resolver | PL | this does a lot of stuff, my rewrites splits it up |
semantic | flatten | PL | |
semantic | lowering | PL -> RQ | |
sql | preprocess | RQ | |
sql | srq compiler | RQ -> SRQ | |
sql | postprocess | SRQ | |
sql | SQL compiler | SRQ -> sqlparser::ast | |
sql | codegen | sqlparser::ast -> string |
Something that would probably help with code readability would be to use module-qualified names of AST nodes:
- use crate::ir::pl::Expr; - - fn print(e: Expr) -> ... + use crate::ir::pl; + + fn print(e: pl::Expr) -> ...
This is what we do at edgedb, and it is a life saver. It also means that when you add a piece of code that needs
pl::InterpolationItem
, you probably don't need to add a new import, becausepl
is probably already imported.
This would be very nice... as a relatively new contributor, I spent so much time tracking down "is it this Expr or that Expr?" The duplication of names made searching the codebase much harder.
You can even see this in the GitHub interface - just find any instance of the word "Expr" and click on it, it shows six different places where you might need to check for its implementation!
What is the tradeoff on struct naming conventions for:
ir::pl::Expr
ir::rq::Expr
vs
ir::pl::PlExpr
ir::rq::RqExpr
?
Do we see one as much highly preferred than the other?
I dislike the second option a lot because:
I know naming conventions are practically the definition of bikeshedding... but again, as a new contributor -- I favor things like ir::pl::PlExpr
, simply because it provides better signposting when you find yourself in the middle of an unfamiliar codebase and need to get your bearings. Frequently I found that it isn't obvious at first glance what compiler phase you're in if you've been given a random line of code that's throwing a compiler panic. To be honest, I still feel that way sometimes when I'm wandering around the codebase. It would make it much easier if it was a bit more explicit that "this code is working on PL".
Unfortunately it appears that GitHub's symbol explorer isn't clever enough to look at the use
statements to figure out which Expr
is in scope for a given module definition. I mostly mention that because as someone who was considering contributing, it was important for me to try to get my bearings in the codebase before I went to the trouble of firing up a more sophisticated code editor. It might seem like a weird reason, but simple tools like the GitHub web interface or grep
would be a lot easier to use if we went with PlExpr
.
And what do you think about that using pl::Expr
instead of plain Expr
?
Unfortunately it appears that GitHub's symbol explorer isn't clever enough to look at the
use
statements to figure out whichExpr
is in scope for a given module definition.
This feels like a good moment to mention that searching by exact symbol is a feature of a full-fledged "smart" IDE like RustRover:
And what do you think about that using
pl::Expr
instead of plainExpr
?
Definitely a good idea! I don't know if it would help the GitHub interface be smarter (I'm guessing not) but it would still be helpful to newcomers.
And what do you think about that using
pl::Expr
instead of plainExpr
?
Definitely — I did a very minor version of this, as an example, in https://github.com/PRQL/prql/pull/4398
@kgutwin Yeah, github interface is not on par with rust-analyzer of rust-rover.
Ok, deal. That can be a general rule from now on: each reference to any AST type should be module-qualified.
@max-sixty Are we set on the name PT for the prqlc-ast/early AST/parser AST?
If we go with this, we need to come up with a nice name for the prqlc-ast. We floated a few ideas before, we just need to pick one. I suggest one of "Parser Syntax Tree" / "Parser Tree", with abbreviations PST / PT.
If we go with this, we need to come up with a nice name for the prqlc-ast. We floated a few ideas before, we just need to pick one. I suggest one of "Parser Syntax Tree" / "Parser Tree", with abbreviations PST / PT.
Yes!
Though a clever-but-maybe-confusing approach would be to keep combine characters from PRQL for the names — so we could use "PR" for "Parser Representation". PR->PL->RQ->...
(Tokens
could be "LR" for "Lexer Representation"!)
:D The last one would be QL?
These are all available names. We can have up to 6 IRs:
PRQL
-- PR
- - PQ
- - PL
-- RQ
- - RL
-- QL
Depends if we allow permutations or combinations!
We get 12 of we allow permutations such as LR
!
I actually think would be fun to do — iff we can backronym them. Lexer Representation / Parser Language / Qualified Representation / etc!
OK if this works it would be fun.
~(Maybe it doesn't work, feel free to pull the rip cord...)~
~How about:~
Edited based on discussion below...
stage | sub-stage | AST | comment |
---|---|---|---|
parse | lexer | string -> ~tokens~ LR — Lexer Representation | |
parse | parser | tokens -> PR — Parser Representation | |
semantic | ast_expand | PL — Pipelined Language | |
semantic | resolver | PL | |
semantic | flatten | PL | |
semantic | lowering | PL -> RQ — Resolved Query | |
sql | preprocess | RQ | |
sql | srq compiler | RQ -> ~SRQ~ PQ — Partitioned Query | |
sql | postprocess | SRQ | |
sql | SQL compiler | SRQ -> sqlparser::ast | |
sql | codegen | sqlparser::ast -> string |
I like the LR and PR, names confirmed.
PL stands for "Pipelined Language" since this representation does not make much assumptions about relations. RQ stands for "Relational Query", but "Resolved Query" makes sense too.
I'm not so sure about LQ - arguably RQ is more linear than SRQ. The main point of SRQ is that is has the same query structure as the SQL output: pipelines have been split, subqueries were fixed to some location and set operators have been constructed. Maybe PQ, for "Partitioned Query"?
Maybe PQ, for "Partitioned Query"?
Super!
Have edited the comment above
Ok, this is sorted then, we can start refactoring.
I wouldn't mark this as a good first issue, since it involves moving a lot of files and types.
I wouldn't mark this as a good first issue, since it involves moving a lot of files and types.
OK
I think we can / should do it in stages.
@m-span if this is something you're interested in, please feel free to start. As above, let's start with the items that we can move directly into prqlc-parser
since those have the least ambiguity. No need to wait for a review if you're confident; I am generally quite available for reviews if that's helpful — though much easier for lots of small changes than one huge PR :)
From this discussion, we have the following new terminology:
stage | sub-stage | AST |
---|---|---|
parse | lexer | string -> LR — Lexer Representation |
parse | parser | tokens -> PR — Parser Representation |
semantic | ast_expand | PR -> PL — Pipelined Language |
semantic | resolver | PL |
semantic | flatten | PL |
semantic | lowering | PL -> RQ — Resolved Query |
sql | preprocess | RQ |
sql | srq-compiler | RQ -> PQ — Partitioned Query |
sql | postprocess | PQ |
sql | sql-compiler | PQ -> sqlparser::ast |
sql | codegen | sqlparser::ast -> string |
Goals:
prqlc-ast
to prqlc-parser
. prqlc-ast
is to be deprecated(I've edited your table, sql stage not entirely correct)
And with #4634 about to merge, the parser has been refactored 🎉 . Next, it seems to me, is to:
Expr
-> pl::Expr
). The only exception is for code within that AST module itself.module exporting strategy
What do you mean here? What to export (and re-export) from crates? Or what to make public in modules?
What do you mean here? What to export (and re-export) from crates? Or what to make public in modules?
Some of both. At the moment it's fairly scattered — for example https://github.com/PRQL/prql/pull/4634#discussion_r1646810136
I see.
I think I should explain my strategy at this point, but that also means I have to try to formalize thinking.
For the AST modules, I've added this exporting because there will never be name conflicts and we don't want to be writing pl::expr::Expr
, but just pl::Expr
, which is clear enough.
For compiler modules, I try design a narrow complexity-bottleneck, which in practice means "export as little as possible, ideally a single function".
For other modules, do as little of re-exporting as possible, because that effectively "moves" a declaration so when referring to it looks like it at a different location than it really is.
I hope this is all. I reserve the right to change this for When we find more weird exports I wrote :D
Great thread and direction you are moving in with separating out the different compiler stages! Looking forward to getting involved in that myself some time and it looks like that will be much simpler in future.
The following really resonated with me:
Something that would probably help with code readability would be to use module-qualified names of AST nodes:
+ use crate::ir::pl; + + fn print(e: pl::Expr) -> ...
This is what we do at edgedb, and it is a life saver. It also means that when you add a piece of code that needs pl::InterpolationItem, you probably don't need to add a new import, because pl is probably already imported.
I would like to make an appeal that we should follow the same principle for PRQL itself when it comes to the module system and imports. I was also reminded of this by a recent comment on HackerNews:
For anyone designing a programming language, enforce namespace to includes/imports! and if possible, don't allow top level side effects.
let foo = include "lib/foo.hurl" foo.init()
it's much easier to reason about then, for example:
include "lib/foo.hurl" // side effects baz(buz) // function and variable that I have no idea if they are in the standard library, or included *somewhere*
That way it's much easier
The lack of this is something that really turned me of Nim which has a lot to like about it otherwise. I think discoverability is a really key aspect of a language. It also aids autocompletion. @aljazerzen and @m-span 's comments above underscore that it also makes it easier when working with the code, particularly when not working in an IDE which I believe is of particular importance for PRQL because many (most?) users will find themselves in many different environments like DuckDB, Postgres, psql, DBeaver, ... with wildly different levels of IDE support.
Here's a very specific example from the code:
in prqlc/ir/pl/expr.rs we have:
...
#[derive(Debug, EnumAsInner, PartialEq, Clone, Serialize, Deserialize, strum::AsRefStr)]
pub enum ExprKind {
Ident(Ident),
All {
within: Box<Expr>,
except: Box<Expr>,
},
Literal(Literal),
Tuple(Vec<Expr>),
...
A question exists on how to properly reference the enum Literal
here, ambiguously as the second one in the line, within the parenthesis of: Literal (Literal)
This enum is located in prqlc-parser/src/lexer/lr/token.rs and it is used by EVERY AST.
Certainly, we all agree that there should be a module qualified-name. But there are two options to go about doing this. 1) prqlc/ir/pl/expr.rs
...
use prqlc_parser::lexer::lr;
...
#[derive(Debug, EnumAsInner, PartialEq, Clone, Serialize, Deserialize, strum::AsRefStr)]
pub enum ExprKind {
Ident(Ident),
All {
within: Box<Expr>,
except: Box<Expr>,
},
Literal(lr::Literal),
Tuple(Vec<Expr>),
...
2) prqlc-parser/parser/pr/mod.rs
pub use lexer::lr::Literal;
prqlc/ir/pl/expr.rs
use prqlc_parser::parser::pr;
...
#[derive(Debug, EnumAsInner, PartialEq, Clone, Serialize, Deserialize, strum::AsRefStr)]
pub enum ExprKind {
Ident(Ident),
All {
within: Box<Expr>,
except: Box<Expr>,
},
Literal(pr::Literal),
Tuple(Vec<Expr>),
...
Recall that the flow of ast objects is LR -> PR -> PL -> RQ -> PQ. This style would be continued such that pq::Literal
exists, but of course, indirectly, is referencing the code within the Lexer
This method means that if you wish to import PRQL
and use, for example, the PL AST in your code, you would only have to write use prqlc::ir::pl
and you'd have access to all of the ast objects, like pl::Expr
, pl::Literal
, etc.
Downside: following where pq::Literal
exists can be a pain, especially in the GitHub IDE. This kinda obfuscates that it's being passed around multiple times.
Of course, we could instead:
3) Redefine Literal in each AST...but I don't think anyone would support that level of code duplication
Unrelatedly, (but also important) can we rename either of the enums within Literal(Literal)
? This is super confusing during refactors even while paying very close attention, since, within prqlc-parser/src/lexer/lr/token.rs, there also exists:
#[derive(Clone, PartialEq, Debug, Serialize, Deserialize)]
pub enum TokenKind {
...
Literal(Literal),
...
Two enums, same file, same name. Yuck! Literally confusing 😊
If I understand correctly, the issue is that:
Literal
in LRpl::Literal
? Or lr::Literal
?Good question. I'm not sure and I don't have enough experience with this to come to an informed view. I would probably hesitate before rolling out one sweeping choice unless anyone has a strong view. (Whereas for something like Expr
we could roll out a big change whenever)
One option would be: if we only have one Literal
throughout the code, then we don't need a prefix. Possibly that becomes difficult to oversee though (and there's a case that doesn't address: if we multiple similarly-named structs, each for multiple stages a different Literal
for others...)
Literal(Literal),
My prior was it's quite reasonable to name an Enum variant the same as another Enum / Struct (and is fairly common?). I don't think they can be semantically ambiguous. Though possibly they can be lexically ambiguous if someone is doing text replacements?
I would probably hesitate before rolling out one sweeping choice unless anyone has a strong view.
Well, the reason I ask this is because we currently use a confusing mix of the two styles. And there isn't exactly only one Literal
in total, since ExprKind::Literal
is also technically a Literal
For the AST modules, I've added this exporting because there will never be name conflicts and we don't want to be writing
pl::expr::Expr
, but justpl::Expr
, which is clear enough.For compiler modules, I try design a narrow complexity-bottleneck, which in practice means "export as little as possible, ideally a single function".
For other modules, do as little of re-exporting as possible, because that effectively "moves" a declaration so when referring to it looks like it at a different location than it really is.
I think this is excellent!
Our current code is less strict than this though. For example, in the one case linked above, this removes some ambiguous exports by removing the pub mod
:
And there isn't exactly only one
Literal
in total, sinceExprKind::Literal
is also technically aLiteral
...but...
My prior was it's quite reasonable to name an Enum variant the same as another Enum / Struct (and is fairly common?). I don't think they can be semantically ambiguous. Though possibly they can be lexically ambiguous if someone is doing text replacements?
I'm torn between options 1 and 2. Option 3 (duplicating definitions) is unnecessary in this case, but could be used in the future, if the is need to expand/shrink literals trough the stages.
A case for option 1 is (as I wrote above):
A case for option 2 is:
pr::ExprKind::Literal(lr::Literal::Null)
one must know to pull one node out of pr
and one from lr
,use crate::pr
and no other prior stages,I seems I lean toward option 2 more than 1. Ultimately, this is easy to refactor, so we can change this decision later.
I think I'm also in favor of option 2, for most (all) of @aljazerzen's reasons.
To reduce the confusion concern, it makes sense to drop a README or some comments that generally explain the reasoning... maybe mostly as an apology to anyone who grumbles about chasing references 🙂 It would be ideal to have the documentation be as detailed as a map ("want to find Literal? Look here...") but that's more work than necessary especially as the internals are still in flux.
Ultimately, this is easy to refactor, so we can change this decision later.
(Yes. In these sorts of decisions I would tend towards starting small and letting it soak for a bit, rather than one big refactor that we may oscillate on)
The code is a mix of option 1 and 2 for now, so we can see how we feel as it grows. With #4659 merged, I'm closing this issue - I think everything has been addressed
This is also merged into feat-types (or rather feat-indirection) and I can confirm it was an overall improvement.
What's up?
As discussed with @aljazerzen & @m-span (though not necessarily agreed yet, feedback welcome),
prqlc-ast
isn't quite pulling its weight as a "cross-compiler external interface".I'm very keen to continue to make the compiler more modular by stage. But this crate doesn't do that — instead it adds context to understanding a particular stage.
So I think it makes sense to:
prqlc-parser
the items which only haveprqlc-parser
as a dependent. For examplespan.rs
(ortoken.rs
, which I recently combined with similar code)prqlc
also depends onSpan
— but only because it's used inprqlc-parser
, I'm counting that as "only haveprqlc-parser
as a dependent", hope that's not too confusing...)generic.rs
which lots of crates depend on — we could consider also dissolving that. But we can wait to see what's leftKeen not to just continually refactor the library (meme below), but in this case I do think it would make it simpler and is worthwhile.