PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.84k stars 216 forks source link

Gradually move `prlqc-ast` code into other crates #4603

Closed max-sixty closed 3 months ago

max-sixty commented 3 months ago

What's up?

As discussed with @aljazerzen & @m-span (though not necessarily agreed yet, feedback welcome), prqlc-ast isn't quite pulling its weight as a "cross-compiler external interface".

I'm very keen to continue to make the compiler more modular by stage. But this crate doesn't do that — instead it adds context to understanding a particular stage.

So I think it makes sense to:

Keen not to just continually refactor the library (meme below), but in this case I do think it would make it simpler and is worthwhile.

m-span commented 3 months ago

I have some ideas on how to handle this, but as I understand, @aljazerzen is in the middle of a refactor. I've generally found it bad practice to have more than one refactor being worked on at the same time, as the second to be merged in will have a hell of a bad time with it. Ironically, if the organization @max-sixty proposes is in place, this wouldn't be much of an issue.

@aljazerzen what is the state of your refactor, and how far does it encompass? Can you give me an idea of what you are/are not touching so I can respect the work you've done so far?

m-span commented 3 months ago

Another related question:

            PRQL

   (parser) │ ▲
 prql_to_pl │ │ pl_to_prql
            │ │
            ▼ │      json::from_pl
                   ────────►
           PL AST            PL JSON
                   ◄────────
            │        json::to_pl
            │
 (resolver) │
   pl_to_rq │
            │
            │
            ▼        json::from_rq
                   ────────►
           RQ AST            RQ JSON
                   ◄────────
            │        json::to_rq
            │ 
  rq_to_sql │ (SQL compiler)
            ▼

            SQL

This diagram should be accurate, but isn't quite up to date, since PL and AST aren't exactly the same, and prql_to_pl technically outputs in prqlc-ast format. Part of the confusion is the different terminology. Should the "parser" be returning PL instead of (parser) AST? If so, should the PL objects be moved to the parser as well?

The end result might be that everything called by prql_to_pl (and possibly pl_to_prql too) exists within the "parser"

m-span commented 3 months ago

Right now it looks like the resolver is actually converting from AST to PL, then to RQ.

aljazerzen commented 3 months ago

I'm in favor of this change - a separate crate is unnecessary and tedious to use.

It will be painful to merge this into my branch, but let's not take that as an excuse to put off improvements on the master branch. In the main prqlc crate I've added use prqlc_ast as ast into the main lib.rs, so everywhere else, I (am trying to) have ast imported as use crate::ast, which would make merging this much easier.

One thing I would ask, is to do this in separate commits (or PRs, just that the master branch contains separate commits in the end).

(PS: my branch is now closer to a complete rewrite of semantic module than a refactor. To merge that, we will have to make changes to the language itself)

The diagram in the lib should be updated, I agree, but we need to rethink how we see the compiler stages.

aljazerzen commented 3 months ago

Also, a bit of terminology decision: is PL an Abstract Syntax Tree? is RQ an Abstract Syntax Tree?

When the prql-ast was split into separate crate, it assumed that PL and RQ are IRs, not an AST. I think that the prqlc-ast, PL and RQ should all be named ASTs, because it is nice to have a name that covers all of these representations.

If we go with this, we need to come up with a nice name for the prqlc-ast. We floated a few ideas before, we just need to pick one. I suggest one of "Parser Syntax Tree" / "Parser Tree", with abbreviations PST / PT.


Something that would probably help with code readability would be to use module-qualified names of AST nodes:

- use crate::ir::pl::Expr;
-
- fn print(e: Expr) -> ...

+ use crate::ir::pl;
+
+ fn print(e: pl::Expr) -> ...

This is what we do at edgedb, and it is a life saver. It also means that when you add a piece of code that needs pl::InterpolationItem, you probably don't need to add a new import, because pl is probably already imported.

aljazerzen commented 3 months ago

These are stages of the compiler as I see them now:

stage sub-stage AST comment
parse lexer string -> tokens
parse parser tokens -> prqlc-ast PT
semantic ast_expand PL bad name, this could be called "desugar"
semantic resolver PL this does a lot of stuff, my rewrites splits it up
semantic flatten PL
semantic lowering PL -> RQ
sql preprocess RQ
sql srq compiler RQ -> SRQ
sql postprocess SRQ
sql SQL compiler SRQ -> sqlparser::ast
sql codegen sqlparser::ast -> string
kgutwin commented 3 months ago

Something that would probably help with code readability would be to use module-qualified names of AST nodes:

- use crate::ir::pl::Expr;
-
- fn print(e: Expr) -> ...

+ use crate::ir::pl;
+
+ fn print(e: pl::Expr) -> ...

This is what we do at edgedb, and it is a life saver. It also means that when you add a piece of code that needs pl::InterpolationItem, you probably don't need to add a new import, because pl is probably already imported.

This would be very nice... as a relatively new contributor, I spent so much time tracking down "is it this Expr or that Expr?" The duplication of names made searching the codebase much harder.

You can even see this in the GitHub interface - just find any instance of the word "Expr" and click on it, it shows six different places where you might need to check for its implementation!

image
m-span commented 3 months ago

What is the tradeoff on struct naming conventions for:

ir::pl::Expr

ir::rq::Expr

vs

ir::pl::PlExpr

ir::rq::RqExpr

?

Do we see one as much highly preferred than the other?

aljazerzen commented 3 months ago

I dislike the second option a lot because:

kgutwin commented 3 months ago

I know naming conventions are practically the definition of bikeshedding... but again, as a new contributor -- I favor things like ir::pl::PlExpr, simply because it provides better signposting when you find yourself in the middle of an unfamiliar codebase and need to get your bearings. Frequently I found that it isn't obvious at first glance what compiler phase you're in if you've been given a random line of code that's throwing a compiler panic. To be honest, I still feel that way sometimes when I'm wandering around the codebase. It would make it much easier if it was a bit more explicit that "this code is working on PL".

Unfortunately it appears that GitHub's symbol explorer isn't clever enough to look at the use statements to figure out which Expr is in scope for a given module definition. I mostly mention that because as someone who was considering contributing, it was important for me to try to get my bearings in the codebase before I went to the trouble of firing up a more sophisticated code editor. It might seem like a weird reason, but simple tools like the GitHub web interface or grep would be a lot easier to use if we went with PlExpr.

aljazerzen commented 3 months ago

And what do you think about that using pl::Expr instead of plain Expr?

m-span commented 3 months ago

Unfortunately it appears that GitHub's symbol explorer isn't clever enough to look at the use statements to figure out which Expr is in scope for a given module definition.

This feels like a good moment to mention that searching by exact symbol is a feature of a full-fledged "smart" IDE like RustRover:

image
kgutwin commented 3 months ago

And what do you think about that using pl::Expr instead of plain Expr?

Definitely a good idea! I don't know if it would help the GitHub interface be smarter (I'm guessing not) but it would still be helpful to newcomers.

max-sixty commented 3 months ago

And what do you think about that using pl::Expr instead of plain Expr?

Definitely — I did a very minor version of this, as an example, in https://github.com/PRQL/prql/pull/4398

aljazerzen commented 3 months ago

@kgutwin Yeah, github interface is not on par with rust-analyzer of rust-rover.


Ok, deal. That can be a general rule from now on: each reference to any AST type should be module-qualified.


@max-sixty Are we set on the name PT for the prqlc-ast/early AST/parser AST?

If we go with this, we need to come up with a nice name for the prqlc-ast. We floated a few ideas before, we just need to pick one. I suggest one of "Parser Syntax Tree" / "Parser Tree", with abbreviations PST / PT.

max-sixty commented 3 months ago

If we go with this, we need to come up with a nice name for the prqlc-ast. We floated a few ideas before, we just need to pick one. I suggest one of "Parser Syntax Tree" / "Parser Tree", with abbreviations PST / PT.

Yes!

Though a clever-but-maybe-confusing approach would be to keep combine characters from PRQL for the names — so we could use "PR" for "Parser Representation". PR->PL->RQ->...

(Tokens could be "LR" for "Lexer Representation"!)

aljazerzen commented 3 months ago

:D The last one would be QL?

These are all available names. We can have up to 6 IRs:

PRQL
--   PR
- -  PQ
-  - PL
 --  RQ
 - - RL
  -- QL
max-sixty commented 3 months ago

Depends if we allow permutations or combinations!

We get 12 of we allow permutations such as LR!

I actually think would be fun to do — iff we can backronym them. Lexer Representation / Parser Language / Qualified Representation / etc!

max-sixty commented 3 months ago

OK if this works it would be fun.

~(Maybe it doesn't work, feel free to pull the rip cord...)~

~How about:~

Edited based on discussion below...

stage sub-stage AST comment
parse lexer string -> ~tokens~ LR — Lexer Representation
parse parser tokens -> PR — Parser Representation
semantic ast_expand PL — Pipelined Language
semantic resolver PL
semantic flatten PL
semantic lowering PL -> RQ — Resolved Query
sql preprocess RQ
sql srq compiler RQ -> ~SRQ~ PQ — Partitioned Query
sql postprocess SRQ
sql SQL compiler SRQ -> sqlparser::ast
sql codegen sqlparser::ast -> string
aljazerzen commented 3 months ago

I like the LR and PR, names confirmed.

PL stands for "Pipelined Language" since this representation does not make much assumptions about relations. RQ stands for "Relational Query", but "Resolved Query" makes sense too.

I'm not so sure about LQ - arguably RQ is more linear than SRQ. The main point of SRQ is that is has the same query structure as the SQL output: pipelines have been split, subqueries were fixed to some location and set operators have been constructed. Maybe PQ, for "Partitioned Query"?

max-sixty commented 3 months ago

Maybe PQ, for "Partitioned Query"?

Super!

Have edited the comment above

aljazerzen commented 3 months ago

Ok, this is sorted then, we can start refactoring.

I wouldn't mark this as a good first issue, since it involves moving a lot of files and types.

max-sixty commented 3 months ago

I wouldn't mark this as a good first issue, since it involves moving a lot of files and types.

OK

I think we can / should do it in stages.

@m-span if this is something you're interested in, please feel free to start. As above, let's start with the items that we can move directly into prqlc-parser since those have the least ambiguity. No need to wait for a review if you're confident; I am generally quite available for reviews if that's helpful — though much easier for lots of small changes than one huge PR :)

m-span commented 3 months ago

From this discussion, we have the following new terminology:

stage sub-stage AST
parse lexer string -> LR — Lexer Representation
parse parser tokens -> PR — Parser Representation
semantic ast_expand PR -> PL — Pipelined Language
semantic resolver PL
semantic flatten PL
semantic lowering PL -> RQ — Resolved Query
sql preprocess RQ
sql srq-compiler RQ -> PQ — Partitioned Query
sql postprocess PQ
sql sql-compiler PQ -> sqlparser::ast
sql codegen sqlparser::ast -> string

Goals:

aljazerzen commented 3 months ago

(I've edited your table, sql stage not entirely correct)

m-span commented 3 months ago

And with #4634 about to merge, the parser has been refactored 🎉 . Next, it seems to me, is to:

aljazerzen commented 3 months ago

module exporting strategy

What do you mean here? What to export (and re-export) from crates? Or what to make public in modules?

max-sixty commented 3 months ago

What do you mean here? What to export (and re-export) from crates? Or what to make public in modules?

Some of both. At the moment it's fairly scattered — for example https://github.com/PRQL/prql/pull/4634#discussion_r1646810136

aljazerzen commented 3 months ago

I see.

I think I should explain my strategy at this point, but that also means I have to try to formalize thinking.

For the AST modules, I've added this exporting because there will never be name conflicts and we don't want to be writing pl::expr::Expr, but just pl::Expr, which is clear enough.

For compiler modules, I try design a narrow complexity-bottleneck, which in practice means "export as little as possible, ideally a single function".

For other modules, do as little of re-exporting as possible, because that effectively "moves" a declaration so when referring to it looks like it at a different location than it really is.

I hope this is all. I reserve the right to change this for When we find more weird exports I wrote :D

snth commented 3 months ago

Great thread and direction you are moving in with separating out the different compiler stages! Looking forward to getting involved in that myself some time and it looks like that will be much simpler in future.

The following really resonated with me:

Something that would probably help with code readability would be to use module-qualified names of AST nodes:

+ use crate::ir::pl;
+
+ fn print(e: pl::Expr) -> ...

This is what we do at edgedb, and it is a life saver. It also means that when you add a piece of code that needs pl::InterpolationItem, you probably don't need to add a new import, because pl is probably already imported.

I would like to make an appeal that we should follow the same principle for PRQL itself when it comes to the module system and imports. I was also reminded of this by a recent comment on HackerNews:

For anyone designing a programming language, enforce namespace to includes/imports! and if possible, don't allow top level side effects.

let foo = include "lib/foo.hurl"
foo.init()

it's much easier to reason about then, for example:

include "lib/foo.hurl" // side effects
baz(buz) // function and variable that I have no idea if they are in the standard library, or included *somewhere*

That way it's much easier

The lack of this is something that really turned me of Nim which has a lot to like about it otherwise. I think discoverability is a really key aspect of a language. It also aids autocompletion. @aljazerzen and @m-span 's comments above underscore that it also makes it easier when working with the code, particularly when not working in an IDE which I believe is of particular importance for PRQL because many (most?) users will find themselves in many different environments like DuckDB, Postgres, psql, DBeaver, ... with wildly different levels of IDE support.

m-span commented 3 months ago

Here's a very specific example from the code:

in prqlc/ir/pl/expr.rs we have:

...
#[derive(Debug, EnumAsInner, PartialEq, Clone, Serialize, Deserialize, strum::AsRefStr)]
pub enum ExprKind {
    Ident(Ident),
    All {
        within: Box<Expr>,
        except: Box<Expr>,
    },
    Literal(Literal),

    Tuple(Vec<Expr>),

...

A question exists on how to properly reference the enum Literal here, ambiguously as the second one in the line, within the parenthesis of: Literal (Literal)

This enum is located in prqlc-parser/src/lexer/lr/token.rs and it is used by EVERY AST.

Certainly, we all agree that there should be a module qualified-name. But there are two options to go about doing this. 1) prqlc/ir/pl/expr.rs

...
use prqlc_parser::lexer::lr;
...
#[derive(Debug, EnumAsInner, PartialEq, Clone, Serialize, Deserialize, strum::AsRefStr)]
pub enum ExprKind {
    Ident(Ident),
    All {
        within: Box<Expr>,
        except: Box<Expr>,
    },
    Literal(lr::Literal),

    Tuple(Vec<Expr>),

...

2) prqlc-parser/parser/pr/mod.rs

pub use lexer::lr::Literal;

prqlc/ir/pl/expr.rs

use prqlc_parser::parser::pr;
...
#[derive(Debug, EnumAsInner, PartialEq, Clone, Serialize, Deserialize, strum::AsRefStr)]
pub enum ExprKind {
    Ident(Ident),
    All {
        within: Box<Expr>,
        except: Box<Expr>,
    },
    Literal(pr::Literal),

    Tuple(Vec<Expr>),

...

Recall that the flow of ast objects is LR -> PR -> PL -> RQ -> PQ. This style would be continued such that pq::Literal exists, but of course, indirectly, is referencing the code within the Lexer This method means that if you wish to import PRQL and use, for example, the PL AST in your code, you would only have to write use prqlc::ir::pl and you'd have access to all of the ast objects, like pl::Expr, pl::Literal, etc.

Downside: following where pq::Literal exists can be a pain, especially in the GitHub IDE. This kinda obfuscates that it's being passed around multiple times. Of course, we could instead:

3) Redefine Literal in each AST...but I don't think anyone would support that level of code duplication


Unrelatedly, (but also important) can we rename either of the enums within Literal(Literal)? This is super confusing during refactors even while paying very close attention, since, within prqlc-parser/src/lexer/lr/token.rs, there also exists:

#[derive(Clone, PartialEq, Debug, Serialize, Deserialize)]
pub enum TokenKind {
...
    Literal(Literal),
...

Two enums, same file, same name. Yuck! Literally confusing 😊

max-sixty commented 3 months ago

If I understand correctly, the issue is that:

Good question. I'm not sure and I don't have enough experience with this to come to an informed view. I would probably hesitate before rolling out one sweeping choice unless anyone has a strong view. (Whereas for something like Expr we could roll out a big change whenever)

One option would be: if we only have one Literal throughout the code, then we don't need a prefix. Possibly that becomes difficult to oversee though (and there's a case that doesn't address: if we multiple similarly-named structs, each for multiple stages a different Literal for others...)


    Literal(Literal),

My prior was it's quite reasonable to name an Enum variant the same as another Enum / Struct (and is fairly common?). I don't think they can be semantically ambiguous. Though possibly they can be lexically ambiguous if someone is doing text replacements?

m-span commented 3 months ago

I would probably hesitate before rolling out one sweeping choice unless anyone has a strong view.

Well, the reason I ask this is because we currently use a confusing mix of the two styles. And there isn't exactly only one Literal in total, since ExprKind::Literal is also technically a Literal

max-sixty commented 3 months ago

For the AST modules, I've added this exporting because there will never be name conflicts and we don't want to be writing pl::expr::Expr, but just pl::Expr, which is clear enough.

For compiler modules, I try design a narrow complexity-bottleneck, which in practice means "export as little as possible, ideally a single function".

For other modules, do as little of re-exporting as possible, because that effectively "moves" a declaration so when referring to it looks like it at a different location than it really is.

I think this is excellent!

Our current code is less strict than this though. For example, in the one case linked above, this removes some ambiguous exports by removing the pub mod:

```diff diff --git a/prqlc/prqlc-parser/src/parser/expr.rs b/prqlc/prqlc-parser/src/parser/expr.rs index 50547389..f81001e8 100644 --- a/prqlc/prqlc-parser/src/parser/expr.rs +++ b/prqlc/prqlc-parser/src/parser/expr.rs @@ -7,9 +7,9 @@ use crate::error::parse_error::PError; use crate::lexer::lr::{Literal, TokenKind}; use crate::parser::common::{ctrl, ident_part, into_expr, keyword, new_line}; -use crate::parser::pr::ident::Ident; -use crate::parser::pr::ops::{BinOp, UnOp}; +use crate::parser::pr::Ident; use crate::parser::pr::*; +use crate::parser::pr::{BinOp, UnOp}; use crate::parser::types::type_expr; use crate::span::Span; diff --git a/prqlc/prqlc-parser/src/parser/pr/mod.rs b/prqlc/prqlc-parser/src/parser/pr/mod.rs index a0c7c4ec..6ba00bb9 100644 --- a/prqlc/prqlc-parser/src/parser/pr/mod.rs +++ b/prqlc/prqlc-parser/src/parser/pr/mod.rs @@ -2,11 +2,11 @@ // confusion and IIUC serves no purpose — IIUC it guarantees there are two ways // of accessing the same variable -pub mod expr; -pub mod ident; -pub mod ops; -pub mod stmt; -pub mod types; +mod expr; +mod ident; +mod ops; +mod stmt; +mod types; pub use expr::*; pub use ident::*; diff --git a/prqlc/prqlc/src/cli/docs_generator.rs b/prqlc/prqlc/src/cli/docs_generator.rs index 4f8df3ef..09b42e9e 100644 --- a/prqlc/prqlc/src/cli/docs_generator.rs +++ b/prqlc/prqlc/src/cli/docs_generator.rs @@ -1,4 +1,4 @@ -use prqlc::ast::{stmt::StmtKind, ExprKind, Stmt, TyKind, VarDefKind}; +use prqlc::ast::{ExprKind, Stmt, StmtKind, TyKind, VarDefKind}; /// Generate HTML documentation. // pub fn generate_html_docs(stmts: Vec) -> String { diff --git a/prqlc/prqlc/src/cli/mod.rs b/prqlc/prqlc/src/cli/mod.rs index 460ef81a..ada7e62f 100644 --- a/prqlc/prqlc/src/cli/mod.rs +++ b/prqlc/prqlc/src/cli/mod.rs @@ -614,7 +614,7 @@ fn write_output(&mut self, data: &[u8]) -> std::io::Result<()> { } } -fn drop_module_def(stmts: &mut Vec, name: &str) { +fn drop_module_def(stmts: &mut Vec, name: &str) { stmts.retain(|x| x.kind.as_module_def().map_or(true, |m| m.name != name)); } diff --git a/prqlc/prqlc/src/ir/pl/stmt.rs b/prqlc/prqlc/src/ir/pl/stmt.rs index 2eb3f8dd..5b0b2005 100644 --- a/prqlc/prqlc/src/ir/pl/stmt.rs +++ b/prqlc/prqlc/src/ir/pl/stmt.rs @@ -2,8 +2,8 @@ use serde::{Deserialize, Serialize}; use super::expr::Expr; -pub use crate::ast::stmt::QueryDef; use crate::ast::Ident; +pub use crate::ast::QueryDef; use crate::ast::{Span, Ty}; #[derive(Debug, Clone, PartialEq, Serialize, Deserialize)] diff --git a/prqlc/prqlc/src/semantic/eval.rs b/prqlc/prqlc/src/semantic/eval.rs index 8c313785..9277fe31 100644 --- a/prqlc/prqlc/src/semantic/eval.rs +++ b/prqlc/prqlc/src/semantic/eval.rs @@ -7,7 +7,7 @@ use crate::ir::pl::{Expr, ExprKind, Func, FuncParam, Ident, PlFold}; use crate::{Error, Result, Span, WithErrorInfo}; -pub fn eval(expr: crate::ast::expr::Expr) -> Result { +pub fn eval(expr: crate::ast::Expr) -> Result { let expr = ast_expand::expand_expr(expr)?; Evaluator::new().fold_expr(expr) ```
max-sixty commented 3 months ago

And there isn't exactly only one Literal in total, since ExprKind::Literal is also technically a Literal

...but...

My prior was it's quite reasonable to name an Enum variant the same as another Enum / Struct (and is fairly common?). I don't think they can be semantically ambiguous. Though possibly they can be lexically ambiguous if someone is doing text replacements?

aljazerzen commented 3 months ago

I'm torn between options 1 and 2. Option 3 (duplicating definitions) is unnecessary in this case, but could be used in the future, if the is need to expand/shrink literals trough the stages.

A case for option 1 is (as I wrote above):

A case for option 2 is:

I seems I lean toward option 2 more than 1. Ultimately, this is easy to refactor, so we can change this decision later.

kgutwin commented 3 months ago

I think I'm also in favor of option 2, for most (all) of @aljazerzen's reasons.

To reduce the confusion concern, it makes sense to drop a README or some comments that generally explain the reasoning... maybe mostly as an apology to anyone who grumbles about chasing references 🙂 It would be ideal to have the documentation be as detailed as a map ("want to find Literal? Look here...") but that's more work than necessary especially as the internals are still in flux.

max-sixty commented 3 months ago

Ultimately, this is easy to refactor, so we can change this decision later.

(Yes. In these sorts of decisions I would tend towards starting small and letting it soak for a bit, rather than one big refactor that we may oscillate on)

m-span commented 3 months ago

The code is a mix of option 1 and 2 for now, so we can see how we feel as it grows. With #4659 merged, I'm closing this issue - I think everything has been addressed

aljazerzen commented 3 months ago

This is also merged into feat-types (or rather feat-indirection) and I can confirm it was an overall improvement.