PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.81k stars 214 forks source link

Implicit vs Explicit column arguments to `join` #1168

Closed max-sixty closed 1 year ago

max-sixty commented 1 year ago

I was planning to release 0.3 as discussed, with the change in join such that supplying columns to an equi-join requires an additional character to make explicit columns vs. conditions:

from e=employees
-join salaries [id]   # implicit
+join salaries [~id]  # explicit
join l=locations [e.office_address == l.address]

This is because a bare id is treated as a bool condition, like e.office_address == l.address is treated. (and in theory, id could be a bool column)

I'm still fine to do the release and assess feedback, as discussed. But I wanted to raise whether we should:


On the specific language change, I see it as a tradeoff between syntactic simplicity and semantic simplicity:


If anyone has ideas for an alternative representation rather than ~, then feel free to suggest! Though I actually think that ~ is pretty good.

One alternative would be to have a different parameter; e.g. using:[id], but then given the conditions parameter would still be required, we'd have an awkward join locations using:[id] [].


If we do go the explicit route, is there something we can do to make this clearer for users? I would find this quite confusing if I weren't watching the releases and all of a sudden this compiles to something completely different:

from foo
join x [bar]
 SELECT
   foo.*,
-  x.*,
-  bar
+  x.*
 FROM
   foo
-  JOIN x USING(bar)
+  JOIN x ON bar

PRQL has a higher ratio of expectations&excitement vs. users than most projects, so it's fine to make breaking changes atm. But this is potentially quite severe. Assuming we go the explicit route, should we raise an error for a bare column name for a few versions so it's at least obvious when people do this?


Without wanting to zoom out too far, possibly it's worth considering this in the context of overall joins; e.g. #716 & #723


Where do folks end up? As I said prior, @aljazerzen has full rights to respond with 😫, and I'll do the release. semantic was really Herculean, and we're still young enough that we probably underrate velocity.

For transparency, if we do decide to make the change, I'm flat-out with non-PRQL stuff until mid-week, after which I have more time and would be happy to work on this. I'm quite excited to get into working with the new compiler!

[^1]: "small" here means both in character-count and syntactic complexity, in this case ~ is small in character-count but adds to syntactic complexity. For theory around compression, check out source-coding, and I can find better references if folks are interested

mklopets commented 1 year ago

For the actual syntax, if we do need it to be explicit: one option to consider is join [=common_col] which has a slightly more semantic meaning, but could also just look weird – not sure!

aljazerzen commented 1 year ago

This is a great write-up on the issue.

On scale from -2 to +2, my stance on implicit behavior is +1. I'm not concerned with the ambiguity, because using a bool column in join's filter does not make sense. I also agree that having an operator just for joins adds unneeded complexity to the syntax of the language.

The problem with implicit behavior is the implementation -- I cannot figure out how to set the rules on when to apply the conversion from x to left.x == right.x.

Imagine we have to write down the implementation of join in python. It would be something like:

def join(left_table: Table, right_table: Table, join_filter: Column[bool]):
  cross = cartesian_product(left_table, right_table)
  return filter(cross, join_filter)

(Let's imagine that cartesian_product() and filter() are existing functions and we don't care about their implementation.)

How could you check that join_filter is just an ident?

You essentially cannot, because from within the function, you only see the value of the param and not its expression. You could check the type of the column, but in our case, this information will not be available at-compile time in most cases.


If we add arguments to join, we solved the issue partially:

def join(left_table: Table, right_table: Table, join_filter: Column[bool], using: Column):
  cross = cartesian_product(left_table, right_table)
  join_filter.append(
    left_table[using] == right_table[using]
  )
  return filter(cross, join_filter)

We can differentiate between boolean filters and using columns. But the problem now is that argument using is a specific column (a vector of values). It is not a name we could lookup in left_table and right_table.


For such approach to work, we would need to have lookup of the column (left_table[using]) be outside of the function call:

def join(left_table: Table, right_table: Table, using_left: Column, using_right: Column):
  cross = cartesian_product(left_table, right_table)
  join_filter = [using_left == using_right]
  return filter(cross, join_filter)

join(employees, salaries, employees.emp_id, salaries.emp_id)

But this defeats the whole purpose of common operations being concise.


I hope I made the problem clear, please do ask if somethings needs more explanation. I really want someone else than me to see the problem, because it may be that I'm looking at it from the wrong angle.

(I'm using these examples to highlight how semantics of PRQL behave with the new merge - they are much closer to consistent semantics of general programming languages like Python or Rust and further from SQL which defines different semantics for different constructs like USING.)

aljazerzen commented 1 year ago

For the actual syntax, if we do need it to be explicit: one option to consider is join [=common_col] which has a slightly more semantic meaning, but could also just look weird – not sure!

I like this very much! It may clash with assigns, and join [==common_col] may be a little better on that front.

richb-hanover commented 1 year ago

I don't understand the issue here well enough to make an informed comment. Instead I'm going to advise you to consider avoiding the roff problem. (Likely an apocryphal story, but the essence remains true...)

roff was a text formatting program in the early days of Unix. Only after the patent department at Bell Labs embraced it did it catch on.

But... (This is the apocryphal part...) They realized that they had screwed up the command structure, but they had "dozens of users" so they were reluctant to change it. So it inflicted that pain on generations of users.

It seems we're in the "dozens of users" stage. Break things if we must... Thanks.

aljazerzen commented 1 year ago

As @snth suggested, there is an option of having arguments that don't get resolved at all (either via s-strings or noresolve. argument).

This is actually a viable option:

  # std declaration
- func join<table> `default_db.with`<table> filter `noresolve.side`:inner tbl<table> -> null
+ func join<table> `default_db.with`<table> filter `noresolve.using`:null `noresolve.side`:inner tbl<table> -> null
from e=employees
join salaries using:id
join l=locations [e.office_address == l.address]

The downside is that we give up name resolution and table inference, which may cause errors in some cases, where CTEs are created.

max-sixty commented 1 year ago
join salaries using:id

This also violates the current rules on function parameters, because positional params are required: https://prql-lang.org/book/queries/functions.html

max-sixty commented 1 year ago

You essentially cannot, because from within the function, you only see the value of the param and not its expression. You could check the type of the column, but in our case, this information will not be available at-compile time in most cases.

I think this is a key point. For the implicit mode to work, we need to be able to inspect the contents of the expression, and given we don't materialize the expression, that's not possible.

It's a corollary of the "semantic simplicity" point above — the model of encapsulated expressions is in the compiler as well as the language.

So that does weigh me towards going Explicit.

I like the join foo [==bar] syntax!

snth commented 1 year ago

This is a fast moving discussion! (Sorry, I've been putting my comments on Discord and forgot about this issue.)

What I was suggesting is generalising the ~ operator from a "self-equality" operator to a "noresolve" operator (since the cat's out the bag on that one) because it seems the latter might also be useful in other contexts.

I was asking @aljazerzen if the join transform could then detect that the filter argument is of the noresolve type and then switch to using the USING syntax for that.

So the syntax would still be as originally be proposed by @aljazerzen but it would have slightly more general semantics.

So

from employees
join salaries [~id]

would still produce

SELECT *
FROM employees
JOIN salaries USING (id)

while

from user_emails
join marketing_campaigns [hasnt_unsubscribed]

would produce

SELECT *
FROM user_emails
JOIN marketing_campaigns
    ON hasnt_unsubscribed
snth commented 1 year ago

With the noresolve ~ operator, the following example

func json_get json_col key `noresolve.type` -> s"({json_col}->>{key})::{type}"

from sometable
derive [
  id = (json_get "abc" "id" text),
]

could then be rewritten as

func json_get json_col key ~type -> s"({json_col}->>{key})::{type}"

from sometable
derive [
  id = (json_get "abc" "id" text),
]
snth commented 1 year ago

This use of ~ is also somewhat reminiscent of the ~ formula syntax in R:

https://www.datacamp.com/tutorial/r-formula-tutorial

Something that characterizes formulas in R is the tilde operator ~. With this operator, you can actually say: "capture the meaning of this code, without evaluating it right away”. That also explains why you can think of a formula in R as a "quoting" operator.

snth commented 1 year ago

Another thing that came up on the dev call yesterday, which I think we should bring into this discussion, is how strict the name resolution should be.

The two viewpoints as I understood them were:

  1. Ambiguous name references, like ambiguous column references, should throw an error straight away at compile time, forcing the user to specify which parent table the column references.
  2. Ambiguity is ok and we simply produce the SQL and we let the underlying database/RDBMs resolve the ambiguity at runtime.

It was mentioned that the burden of the strict name resolution option could be lessened through tooling, which at authoring time could suggest the ambiguous parent object names as possible autocompletions.

snth commented 1 year ago

I am in favour of option 1, strict name resolution, especially in light of the possibility of a noresolve ~ operator which would provide (yet another) escape hatch so that users could choose to forgo the strict name resolution should they wish or require this. At least then it is a user decision to defer this check to runtime and they may be comfortable with this given their knowledge of the data(base) structure.

The JOIN ... USING syntax may be the exception that proves the rule because the alternative to

from employees
join salaries [~id]

would be the strict mode equivalent of

from employees
join salaries [employees.id==salaries.id]

In the first case the user takes responsibility that id is present in both the employees and salaries tables.

aljazerzen commented 1 year ago

Another thing ... is how strict the name resolution should be.

I think this should be a separate issue.

I was asking @aljazerzen if the join transform could then detect that the filter argument is of the noresolve type ...

Unfortunately, no. For the same reasons that we cannot change behavior of join upon detecting if column has boolean or some other type.

If I understand correctly, this noresolve ~ operator would essentially act as s-string?


I'm familiar with ~ in R, but I never understood how does it work. My current understanding is that functions in R take arguments by expression and not by value (as they do in all other languages I know). This allows R to unpack the input expression and treat it as a formula.

For me, this is one of magic things that R does that I really don't want in PRQL. That's because in simple cases, magic helps a little and in complex cases there is no way to guess what the magic will do.


Invoking DoOcracy, I'll change "self equality operator" syntax to == and add some docs about it, so this doesn't block 0.3

But let's keep this issue open for a few weeks to get more feedback.

max-sixty commented 1 year ago

Agree re separate issue for strictness.

And great re DoOcracy (I hadn't heard of that but very much subscribe!)

And, not to agree with everything in that comment :) , but also re ~ — isn't this just an s-string? (Possibly that's worthy of a separate issue if we're discussing it more broadly than the join syntax). I do think there are legit discussions around how we resolve things, including strictness and ambiguity — it's worth reviewing https://prql-lang.org/book/internals/name-resolving.html for some context on the current impl.

snth commented 1 year ago

Unfortunately, no. For the same reasons that we cannot change behavior of join upon detecting if column has boolean or some other type.

Ok, I thought that one difference might be that the column type would depend on the database and would only be knowable at runtime whereas a distinction between ident and "unresolved ident" if specified in the PRQL code could be known at compile time and could therefore be used to branch the behaviour.

I was just operating on the reply to my comment on Discord that

This would work but we give up name resolution. Which means that you don't get errors at compile time

but I see from your description of my comment above that we had different understandings of what I meant.


If I understand correctly, this noresolve ~ operator would essentially act as s-string?

Probably, yes. I was proposing it as essentially syntactic sugar for the noresolve. prefix that was described in the Discord comment below:

The rule here is that if a param name is prefixed with noresolve. (which gets removed), param will not be resolved and idents will be translated as-is, into s-strings

However rereading that comment again, my understanding now is that that only referred to params and I was kinda taking it out of context and applying it more generally. So to clarify, noresolve. is a way in a function definition to specify that a specific parameter will not be resolved and turned into an s-string?


Thanks for the name-resolving PRQL Book link.

aljazerzen commented 1 year ago

So to clarify, noresolve. is a way in a function definition to specify that a specific parameter will not be resolved and turned into an s-string?

Yes, exactly.

For example join side:left has to be noresolve. so left is not interpreted as an ident referring to a column or something else.

But again, this is implementation detail and I don't want it to become a part of the language, because we may once declare left to be somekind of an enum in std and noresolve. will not be needed anymore.

max-sixty commented 1 year ago

I think we can close this? If there are any docs to write, feel free to assign them to me @aljazerzen

aljazerzen commented 1 year ago

I've already added a section in the PR: https://prql-lang.org/book/transforms/join.html

Take a look and expand if needed.

max-sixty commented 1 year ago

Ah yes, I already did a couple of tweaks in https://github.com/prql/prql/commit/39f2fb26c0e71274ad2f327d4d61492d86c03d86, I forgot