Closed max-sixty closed 1 year ago
For the actual syntax, if we do need it to be explicit: one option to consider is join [=common_col]
which has a slightly more semantic meaning, but could also just look weird – not sure!
This is a great write-up on the issue.
On scale from -2 to +2, my stance on implicit behavior is +1. I'm not concerned with the ambiguity, because using a bool column in join's filter does not make sense. I also agree that having an operator just for joins adds unneeded complexity to the syntax of the language.
The problem with implicit behavior is the implementation -- I cannot figure out how to set the rules on when to apply the conversion from x
to left.x == right.x
.
Imagine we have to write down the implementation of join
in python. It would be something like:
def join(left_table: Table, right_table: Table, join_filter: Column[bool]):
cross = cartesian_product(left_table, right_table)
return filter(cross, join_filter)
(Let's imagine that cartesian_product()
and filter()
are existing functions and we don't care about their implementation.)
How could you check that join_filter
is just an ident?
You essentially cannot, because from within the function, you only see the value of the param and not its expression. You could check the type of the column, but in our case, this information will not be available at-compile time in most cases.
If we add arguments to join, we solved the issue partially:
def join(left_table: Table, right_table: Table, join_filter: Column[bool], using: Column):
cross = cartesian_product(left_table, right_table)
join_filter.append(
left_table[using] == right_table[using]
)
return filter(cross, join_filter)
We can differentiate between boolean filters and using columns. But the problem now is that argument using
is a specific column (a vector of values). It is not a name we could lookup in left_table
and right_table
.
For such approach to work, we would need to have lookup of the column (left_table[using]
) be outside of the function call:
def join(left_table: Table, right_table: Table, using_left: Column, using_right: Column):
cross = cartesian_product(left_table, right_table)
join_filter = [using_left == using_right]
return filter(cross, join_filter)
join(employees, salaries, employees.emp_id, salaries.emp_id)
But this defeats the whole purpose of common operations being concise.
I hope I made the problem clear, please do ask if somethings needs more explanation. I really want someone else than me to see the problem, because it may be that I'm looking at it from the wrong angle.
(I'm using these examples to highlight how semantics of PRQL behave with the new merge - they are much closer to consistent semantics of general programming languages like Python or Rust and further from SQL which defines different semantics for different constructs like USING.)
For the actual syntax, if we do need it to be explicit: one option to consider is
join [=common_col]
which has a slightly more semantic meaning, but could also just look weird – not sure!
I like this very much! It may clash with assigns, and join [==common_col]
may be a little better on that front.
I don't understand the issue here well enough to make an informed comment. Instead I'm going to advise you to consider avoiding the roff
problem. (Likely an apocryphal story, but the essence remains true...)
roff
was a text formatting program in the early days of Unix. Only after the patent department at Bell Labs embraced it did it catch on.
But... (This is the apocryphal part...) They realized that they had screwed up the command structure, but they had "dozens of users" so they were reluctant to change it. So it inflicted that pain on generations of users.
It seems we're in the "dozens of users" stage. Break things if we must... Thanks.
As @snth suggested, there is an option of having arguments that don't get resolved at all (either via s-strings or noresolve.
argument).
This is actually a viable option:
# std declaration
- func join<table> `default_db.with`<table> filter `noresolve.side`:inner tbl<table> -> null
+ func join<table> `default_db.with`<table> filter `noresolve.using`:null `noresolve.side`:inner tbl<table> -> null
from e=employees
join salaries using:id
join l=locations [e.office_address == l.address]
The downside is that we give up name resolution and table inference, which may cause errors in some cases, where CTEs are created.
join salaries using:id
This also violates the current rules on function parameters, because positional params are required: https://prql-lang.org/book/queries/functions.html
You essentially cannot, because from within the function, you only see the value of the param and not its expression. You could check the type of the column, but in our case, this information will not be available at-compile time in most cases.
I think this is a key point. For the implicit mode to work, we need to be able to inspect the contents of the expression, and given we don't materialize the expression, that's not possible.
It's a corollary of the "semantic simplicity" point above — the model of encapsulated expressions is in the compiler as well as the language.
So that does weigh me towards going Explicit.
I like the join foo [==bar]
syntax!
This is a fast moving discussion! (Sorry, I've been putting my comments on Discord and forgot about this issue.)
What I was suggesting is generalising the ~
operator from a "self-equality" operator to a "noresolve" operator (since the cat's out the bag on that one) because it seems the latter might also be useful in other contexts.
I was asking @aljazerzen if the join
transform could then detect that the filter
argument is of the noresolve
type and then switch to using the USING
syntax for that.
So the syntax would still be as originally be proposed by @aljazerzen but it would have slightly more general semantics.
So
from employees
join salaries [~id]
would still produce
SELECT *
FROM employees
JOIN salaries USING (id)
while
from user_emails
join marketing_campaigns [hasnt_unsubscribed]
would produce
SELECT *
FROM user_emails
JOIN marketing_campaigns
ON hasnt_unsubscribed
With the noresolve
~
operator, the following example
func json_get json_col key `noresolve.type` -> s"({json_col}->>{key})::{type}"
from sometable
derive [
id = (json_get "abc" "id" text),
]
could then be rewritten as
func json_get json_col key ~type -> s"({json_col}->>{key})::{type}"
from sometable
derive [
id = (json_get "abc" "id" text),
]
This use of ~
is also somewhat reminiscent of the ~
formula syntax in R:
https://www.datacamp.com/tutorial/r-formula-tutorial
Something that characterizes formulas in R is the tilde operator ~. With this operator, you can actually say: "capture the meaning of this code, without evaluating it right away”. That also explains why you can think of a formula in R as a "quoting" operator.
Another thing that came up on the dev call yesterday, which I think we should bring into this discussion, is how strict the name resolution should be.
The two viewpoints as I understood them were:
It was mentioned that the burden of the strict name resolution option could be lessened through tooling, which at authoring time could suggest the ambiguous parent object names as possible autocompletions.
I am in favour of option 1, strict name resolution, especially in light of the possibility of a noresolve ~
operator which would provide (yet another) escape hatch so that users could choose to forgo the strict name resolution should they wish or require this. At least then it is a user decision to defer this check to runtime and they may be comfortable with this given their knowledge of the data(base) structure.
The JOIN ... USING
syntax may be the exception that proves the rule because the alternative to
from employees
join salaries [~id]
would be the strict mode equivalent of
from employees
join salaries [employees.id==salaries.id]
In the first case the user takes responsibility that id
is present in both the employees
and salaries
tables.
Another thing ... is how strict the name resolution should be.
I think this should be a separate issue.
I was asking @aljazerzen if the join transform could then detect that the filter argument is of the noresolve type ...
Unfortunately, no. For the same reasons that we cannot change behavior of join upon detecting if column has boolean or some other type.
If I understand correctly, this noresolve ~
operator would essentially act as s-string?
I'm familiar with ~ in R, but I never understood how does it work. My current understanding is that functions in R take arguments by expression and not by value (as they do in all other languages I know). This allows R to unpack the input expression and treat it as a formula.
For me, this is one of magic things that R does that I really don't want in PRQL. That's because in simple cases, magic helps a little and in complex cases there is no way to guess what the magic will do.
Invoking DoOcracy, I'll change "self equality operator" syntax to ==
and add some docs about it, so this doesn't block 0.3
But let's keep this issue open for a few weeks to get more feedback.
Agree re separate issue for strictness.
And great re DoOcracy (I hadn't heard of that but very much subscribe!)
And, not to agree with everything in that comment :) , but also re ~
— isn't this just an s-string? (Possibly that's worthy of a separate issue if we're discussing it more broadly than the join syntax). I do think there are legit discussions around how we resolve things, including strictness and ambiguity — it's worth reviewing https://prql-lang.org/book/internals/name-resolving.html for some context on the current impl.
Unfortunately, no. For the same reasons that we cannot change behavior of join upon detecting if column has boolean or some other type.
Ok, I thought that one difference might be that the column type would depend on the database and would only be knowable at runtime whereas a distinction between ident and "unresolved ident" if specified in the PRQL code could be known at compile time and could therefore be used to branch the behaviour.
I was just operating on the reply to my comment on Discord that
This would work but we give up name resolution. Which means that you don't get errors at compile time
but I see from your description of my comment above that we had different understandings of what I meant.
If I understand correctly, this noresolve ~ operator would essentially act as s-string?
Probably, yes. I was proposing it as essentially syntactic sugar for the noresolve.
prefix that was described in the Discord comment below:
The rule here is that if a param name is prefixed with noresolve. (which gets removed), param will not be resolved and idents will be translated as-is, into s-strings
However rereading that comment again, my understanding now is that that only referred to params and I was kinda taking it out of context and applying it more generally. So to clarify, noresolve.
is a way in a function definition to specify that a specific parameter will not be resolved and turned into an s-string?
Thanks for the name-resolving PRQL Book link.
So to clarify, noresolve. is a way in a function definition to specify that a specific parameter will not be resolved and turned into an s-string?
Yes, exactly.
For example join side:left
has to be noresolve.
so left is not interpreted as an ident referring to a column or something else.
But again, this is implementation detail and I don't want it to become a part of the language, because we may once declare left
to be somekind of an enum in std
and noresolve.
will not be needed anymore.
I think we can close this? If there are any docs to write, feel free to assign them to me @aljazerzen
I've already added a section in the PR: https://prql-lang.org/book/transforms/join.html
Take a look and expand if needed.
Ah yes, I already did a couple of tweaks in https://github.com/prql/prql/commit/39f2fb26c0e71274ad2f327d4d61492d86c03d86, I forgot
I was planning to release 0.3 as discussed, with the change in
join
such that supplying columns to an equi-join requires an additional character to make explicit columns vs. conditions:This is because a bare
id
is treated as a bool condition, likee.office_address == l.address
is treated. (and in theory,id
could be a bool column)I'm still fine to do the release and assess feedback, as discussed. But I wanted to raise whether we should:
main
, rather than big user-facing changes in a version number.On the specific language change, I see it as a tradeoff between syntactic simplicity and semantic simplicity:
USING
/ an equi-join / a join between identically named columns] with~
"? (Maybe it feels bigger to us because it's a change and it's so prominent in our docs and examples?)Semantic simplicity / generality — having
[id]
mean something different from[id==true]
in ajoin
breaks the encapsulation of the expression. The compiler needs to understand what's inside the expression; uncorrelated concepts become correlated, the language becomes less general and less orthogonal.bar
a condition or a column in the join? I guess it's a condition because we know it'sbaz==bax
. But ifbar
were materialized in a column in the DB, then the behavior suddendly changes.If anyone has ideas for an alternative representation rather than
~
, then feel free to suggest! Though I actually think that~
is pretty good.One alternative would be to have a different parameter; e.g.
using:[id]
, but then given the conditions parameter would still be required, we'd have an awkwardjoin locations using:[id] []
.If we do go the explicit route, is there something we can do to make this clearer for users? I would find this quite confusing if I weren't watching the releases and all of a sudden this compiles to something completely different:
PRQL has a higher ratio of expectations&excitement vs. users than most projects, so it's fine to make breaking changes atm. But this is potentially quite severe. Assuming we go the explicit route, should we raise an error for a bare column name for a few versions so it's at least obvious when people do this?
Without wanting to zoom out too far, possibly it's worth considering this in the context of overall joins; e.g. #716 & #723
Where do folks end up? As I said prior, @aljazerzen has full rights to respond with 😫, and I'll do the release.
semantic
was really Herculean, and we're still young enough that we probably underrate velocity.For transparency, if we do decide to make the change, I'm flat-out with non-PRQL stuff until mid-week, after which I have more time and would be happy to work on this. I'm quite excited to get into working with the new compiler!
[^1]: "small" here means both in character-count and syntactic complexity, in this case
~
is small in character-count but adds to syntactic complexity. For theory around compression, check out source-coding, and I can find better references if folks are interested