PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.97k stars 218 forks source link

Aliases on relations #4751

Open aljazerzen opened 4 months ago

aljazerzen commented 4 months ago

@snth recently expressed an opinion that relation aliases in PRQL are mistake and should be removed

By relational aliases we mean the name x in following queries:

from x = employees

...

join x = managers

Currently, relational aliases have effect of assigning a name to a relational variable within the query:

from x = employees
select {x.name}

@snth argues that relational aliases are a carryover from SQL and should be removed from PRQL because a) it is not a fundamental part of Relational Algebra, and b) breaks the local nature of PRQL transforms and puts in global dependencies between different pipeline steps.

max-sixty commented 4 months ago

Would there be a different way of creating aliases (e.g. let)? How would self-joins work?

aljazerzen commented 4 months ago

I would tend to agree here, but for different reasons.

Global dependencies between transforms of the pipeline make the queries non-composable, but relational aliases don't really expose these aliases as global "query variables". By this I mean that the following is not valid PRQL:

from x = employees
select {age}
select {age, x.name}  # error: unknown name x.name

Relational aliases are instead encoded into the resulting type of the pipeline. It has the same effect as wrapping the tuple in the relation into another tuple.

from employees    # type: [{id = int, name = text}]
select {x = this} # type: [{x = {id = int, name = text}}]
select {x.name}   # <-- this is a tuple field lookup

In this understanding of PRQL, it is hard to imagine how x = ... would convert type [{...}] into [{x = {...}}]. What implies that x should sink into the array here? Why is it not converted into {x = [{...}]}?

So in this view, relation aliases are very weirdly specific behavior, which I would prefer PRQL not to have.

richb-hanover commented 4 months ago

@snth and @aljazerzen write:

b) breaks the local nature of PRQL transforms and puts in global dependencies between different pipeline steps.

I saw manifestations of this recently - I was joining several tables, some of which had the same column names, and I had trouble knowing whether to use a relation alias, and/or which one to use in subsequent select statements. I ultimately just started "trying stuff" and got it to work, but that's a sign of a less-than-perfect language design.

I also saw a comment about this and that above: they'll work sort of OK for two tables, but I can see it getting more complicated with three or more.

BUT... If relation aliases were to go away, how would this be expressed?

from a=apple
join b=banana (id)
join c=cantaloupe (id)
select {
  ...assortment of columns from each relation...
}
...

Thanks

aljazerzen commented 4 months ago

That's a good question and one that we have been circling around for some time now.

Here is what I propose: join construct a tuple with two unnamed fields, containing left and right tuples:

from apple               # type: [ {apple_id} ]
join banana (id)         # type: [ {{apple_id}, {banana_id}} ]
join c=cantaloupe (id)   # type: [ {{{apple_id}, {banana_id}}, {cantaloupe_id}} ]
select {
  this.0.0.apple_id,
  this.0.1.banana_id,
  this.1.cantaloupe_id,
}

Edit: I've messed-up the last type, cantaloupe_id was not enclosed in a tuple

kgutwin commented 4 months ago

I'm not sure that more use of unnamed fields is going to be easy for users to follow - it puts more weight on the structure of the query rather than less, which feels very un-PRQL-like. Could the fields inherit the names from their source relations?

from apple               # type: [ {apple_id} ]
join banana (id)         # type: [ {apple_id, banana = {banana_id}} ]
join c=cantaloupe (id)   # type: [ {apple_id, banana = {banana_id}, c = {cantaloupe_id}} ]
select {
  apple_id,
  banana.banana_id,
  c.cantaloupe_id,
}

Note that I also flattened the types a bit and changed the bias of the "original" relation from the right to the left. With this, [ {apple_id} ] is kept at the same level through the whole join chain, and any number of additional relations are joined at a single level to the original. Would this also work?

max-sixty commented 4 months ago

I'm not sure that more use of unnamed fields is going to be easy for users to follow - it puts more weight on the structure of the query rather than less, which feels very un-PRQL-like.

+1...


I can see the appeal of some ability to have a nested column structure — artists.tracks.length. But just facially an inability to flatten tuples and requiring this.0.0.apple_id seems really unfriendly. Applying some theory should be making things simpler; we should be skeptical of applying theory that doesn't...[^1]

[^1]: edit: ofc applying theory can often make some things simpler and some things more complicated, and the new simpler way can replace the old complicated ways such that the overall state becomes much better — so I'm not trying to decry any part of a proposal that makes one thing more complicated. But we do need some way to do reasonable joins...

richb-hanover commented 4 months ago

I was thinking along the same lines. I need to take some time to think through my ideas (and my naivety), and will respond by this weekend.

snth commented 4 months ago

Thank you for opening the issue @aljazerzen . My main bugbear is the "global dependencies" or, as you nicely put it, the breaking of composability of pipelines. That together with the fact that I really want to implement some backends other than SQL for PRQL - some of these have eager execution models where backreferences to previous pipeline steps won't work (think Pandas or stream processing) as that requires lazy evaluation (or fusion of steps).

My mental (or "logical") model of PRQL is very simple and is really based on stream processing, for each pipeline step or transform it's "stream of tuples in, stream of tuples out". If we look at the signatures for the major transforms, we have (I started with [{...}] notation but then added regex multiplicities since they are more precise):

transform signature description
select {a} -> {b} (one-to-one), i.e. it operates on just one tuple at a time
aggregate [{a}] -> {b} (many-to-one)
group [{a}] -> [{b}]+ (many-to-oneplus)
filter {a} -> {b}? (one-to-optional)
join {a} -> [{b}]* (one-to-many)

So join has the problem that it requires "flattening" or "flatmapping" as well resolving potentially ambiguous names. group also has the potential for flattening so it appears that it's the resolving of the name ambiguities that's the bigger problem, particularly for the self-join case.

We could look at some prior art for other pipeline based query builders as a start.

I'll address the previous two proposals in a separate comment.

snth commented 4 months ago

I often come back to M Lang (Power Query) because it's actually very similar to PRQL semantically (main difference is that each pipeline step is named) and in many cases generates SQL under the hood (you have to know where to look for it -> it's under "Native Query"). I think they deal with both the proposals by @aljazerzen and @kgutwin above.

Table.Join (Docs)

```m let customers = Table.FromRecords({ [TenantID = 1, CustomerID = 1, Name = "Bob", Phone = "123-4567"], [TenantID = 1, CustomerID = 2, Name = "Jim", Phone = "987-6543"] }), orders = Table.FromRecords({ [TenantID = 1, OrderID = 1, CustomerID = 1, Name = "Fishing rod", Price = 100.0], [TenantID = 1, OrderID = 2, CustomerID = 1, Name = "1 lb. worms", Price = 5.0], [TenantID = 1, OrderID = 3, CustomerID = 2, Name = "Fishing net", Price = 25.0] }) in Table.Join( customers, {"TenantID", "CustomerID"}, Table.PrefixColumns(orders, "Order"), {"Order.TenantID", "Order.CustomerID"} ) ``` **Output:** ```m Table.FromRecords({ [TenantID = 1, CustomerID = 1, Name = "Bob", Phone = "123-4567", Order.TenantID = 1, Order.OrderID = 1, Order.CustomerID = 1, Order.Name = "Fishing rod", Order.Price = 100], [TenantID = 1, CustomerID = 1, Name = "Bob", Phone = "123-4567", Order.TenantID = 1, Order.OrderID = 2, Order.CustomerID = 1, Order.Name = "1 lb. worms", Order.Price = 5], [TenantID = 1, CustomerID = 2, Name = "Jim", Phone = "987-6543", Order.TenantID = 1, Order.OrderID = 3, Order.CustomerID = 2, Order.Name = "Fishing net", Order.Price = 25] }) ```

I think this is roughly equivalent to

from apple               # type: [ {apple_id} ]
join banana (id)         # type: [ {apple_id, banana = {banana_id}} ]
select {
  apple_id,
  banana.banana_id,
}

Notably the ResultSet has already been flattened and banana table fields are accessible via banana.field_name (although they're actually all included with column names banana.field_name rather than nested records).

Table.NestedJoin (Docs)

```m Table.NestedJoin( Table.FromRecords({ [CustomerToCall = 1], [CustomerToCall = 3] }), {"CustomerToCall"}, Table.FromRecords({ [CustomerID = 1, Name = "Bob", Phone = "123-4567"], [CustomerID = 2, Name = "Jim", Phone = "987-6543"], [CustomerID = 3, Name = "Paul", Phone = "543-7890"], [CustomerID = 4, Name = "Ringo", Phone = "232-1550"] }), {"CustomerID"}, "CustomerDetails" ) ``` **Output:** ```m Table.FromRecords({ [CustomerToCall = 1, CustomerDetails = Table.FromRecords({[CustomerID = 1, Name = "Bob", Phone = "123-4567"]})], [CustomerToCall = 3, CustomerDetails = Table.FromRecords({[CustomerID = 3, Name = "Paul", Phone = "543-7890"]})] }) ```

The big difference here is that the ResultSet has the same number of rows as the LHS table and the matching rows from the RHS are grouped into a nested table. This is very nice from a UX perspective (and I find myself reaching for this quite often actually) but it doesn't really translate cleanly into SQL or other backends. However it's nice from a conceptual perspective.

I think the equivalent here would be:

from apple               # type: [ { apple_id } ]
join banana (id)         # type: [ { apple_id, banana=[{banana_id}] } ]

It's not clear how we would do the subsequent flattening or select transforms afterwards.

My conclusion

While I find the NestedJoin the nicest conceptually, unfortunately it doesn't map onto the SQL and other backends. I believe that the cleanest and safest logical foundation for PRQL is to treat it as a DSL of the List Monad. What this means is that in the general form, each transform can have a signature of {a} -> [{b}], i.e. each row from the input stream can potentially produce multiple rows in the output stream and those multiple rows have to be flattened as part of the transform. (I'm not sure how group fits into this picture as I haven't considered that. Comments welcome.)

Therefore, while it's a bit of a pain to deal with the flattening in the join step, I believe it's a necessity for logical consistency.

I'll try to illustrate some ideas for this by means of a Python example in the next comment.

snth commented 4 months ago

Here's a Python example I hacked together because I thought it might make it easier to demonstrate things. Full code is at the end, I'll just focus on the examples for now:

X = [dict(a=1, b=10), dict(a=2, b=20), dict(a=3, b=30)]
Xs = from_(X) | []
print(f"{Xs=}")
# Xs=[{'a': 1, 'b': 10}, {'a': 2, 'b': 20}, {'a': 3, 'b': 30}]

The pipelines produce iterators so the [] at the end just collects the results into a list.

My main idea was that the operand for join should be a parameterised query which can then be evaluated in a particular context. This is illustrated in the example below where b is unbound at the end so in order to evaluate the query we have to provide a value for b as part of the context.

Y = from_(X) | select(c='a+b') | filter_('c<=b') | select(b='b', c='c')
Ys = list(Y.eval(context={'b': 25}))
print(f"{Ys=}")
# Ys=[{'b': 25, 'c': 11}, {'b': 25, 'c': 22}]

We can now use this Y to do joins. The idea is that as we iterate over the input stream, each input tuple is passed as the context to the query/relation/pipeline to be joined.

Z1 = from_(X) | join_parameterised(Y) | []
print(f"{Z1=}")
# Z1=[{'b': 20, 'c': 11}, {'b': 30, 'c': 11}, {'b': 30, 'c': 22}]

In order to do the self-joins, we need to make the input tuples available in the parameterised query. For my examples I used _ as the name but think of this or that which could be used instead. With that the following demonstrates a self-join:

X2 = from_(X) | filter_('a==_.a+1') | select(_a='_.a', _b='_.b', a='a', b='b')
Z2 = from_(X) | join_parameterised(X2) | []
print(f"{Z2=}")
# Z2=[{'_a': 1, '_b': 10, 'a': 2, 'b': 20}, {'_a': 2, '_b': 20, 'a': 3, 'b': 30}]

In general, it seems to me that the operand for a join always needs the following three parts:

We can therefore alternatively pass these as explicit arguments to join as demonstrated in the next example:

Fs = from_(X) | join_explicit(X, on=['a==_.a+1'], fields=dict(_a='_.a', _b='_.b', a='a', b='b')) | []
print(f"{Fs=}")
# Fs=[{'_a': 1, '_b': 10, 'a': 2, 'b': 20}, {'_a': 2, '_b': 20, 'a': 3, 'b': 30}]

My sense is that this would make the lines with joins too long so is probably a less good UX and it would also be less general than the join_parameterised version.

As an aside, in this setup, repeated from transforms produce cross-products. Think of them as nested for loops in list iterator context. I feel that would actually be the best way to specify cross-products imho, e.g.:

print('X x Ys:', from_(X) | from_(Ys) | [])
# X x Ys: [{'a': 1, 'b': 25, 'c': 11}, {'a': 1, 'b': 25, 'c': 22}, {'a': 2, 'b': 25, 'c': 11}, {'a': 2, 'b': 25, 'c': 22}, {'a': 3, 'b': 25, 'c': 11}, {'a': 3, 'b': 25, 'c': 22}]

EDIT: What I'm trying to express with this example is that I believe this model is rich enough to express the required logic. How you efficiently parse and extract the required parameters from this for the compiler, I don't know. I've got more to say about some theoretical considerations but that will have to wait for another time.

Appendix: Full code

```python from collections import namedtuple class Pipeline: def __init__(self): self.steps = [] def __or__(self, step): if isinstance(step, list): step.extend(self) return step self.steps.append(step) return self def __iter__(self): return self.eval() def eval(self, steps=None, context={}): if steps is None: steps = self.steps i = len(steps) if i>1 and isinstance(steps[i-2], let): context = context.copy() context.update(steps[i-2]._bindings) if i>0: transform = steps[i-1] relation = self.eval(steps[:i-1], context) for result in transform.eval(relation, context): yield result else: yield None class Transform: def eval(self, relation, context={}): raise NotImplemented() class let(Transform): def __init__(self, **bindings): self._bindings = bindings def eval(self, relation, context={}): return relation class from_(Transform): def __init__(self, relation: str|list): self._relation = relation._relation if isinstance(relation, Transform) else relation def __or__(self, transform): pipeline = Pipeline() | self return pipeline | transform def eval(self, relation, context={}): for this in relation: if this is None: this = {} for that in self._relation: result = this.copy() result.update(that) yield result class select(Transform): def __init__(self, **fields) -> None: self._fields = fields def eval(self, relation, context={}): for this in relation: result = {} for alias, expr in self._fields.items(): result[alias] = eval(expr, context, this) yield result class filter_(Transform): def __init__(self, *conditions) -> None: self._conditions = conditions def eval(self, relation, context={}): for this in relation: if all(eval(cond, context, this) for cond in self._conditions): yield this class join_parameterised(Transform): def __init__(self, pipeline) -> None: self._pipeline = pipeline def eval(self, relation, context={}): for this in relation: context = dict(_=namedtuple('_', this)(**this), **this) for that in self._pipeline.eval(context=context): yield that class join_explicit(Transform): def __init__(self, pipeline, on, fields, kind='inner') -> None: # NOTE: Can't do right joins self._pipeline = pipeline self._on = [on] if isinstance(on, str) else on self._fields = fields self._kind = kind def eval(self, relation, context={}): _select = select(**self._fields) _search = self._pipeline | filter_(*self._on) | _select for this in relation: context = dict(_=namedtuple('_', this)(**this), **this) j = 0 for that in _search.eval(context=context): yield that j += 1 if j==0 and self._kind in {'full', 'left'}: _missing = self._fields.keys() - this.keys() # FIXME: Need schema info to fill the NULLs correctly for left joins. yield next(_select.eval([{fld:None for fld in _missing}], context=context)) if __name__ == "__main__": data = [dict(a=1, b=10), dict(a=2, b=20), dict(a=3, b=30)] X = from_(data) Xs = from_(X) | [] print(f"{Xs=}") Y = from_(X) | select(c='a+b') | filter_('c<=b') | select(b='b', c='c') Ys = list(Y.eval(context={'b': 25})) print(f"{Ys=}") Z1 = from_(X) | join_parameterised(Y) | [] print(f"{Z1=}") X2 = from_(X) | filter_('a==_.a+1') | select(_a='_.a', _b='_.b', a='a', b='b') Z2 = from_(X) | join_parameterised(X2) | [] print(f"{Z2=}") Fs = from_(X) | join_explicit(X, on=['a==_.a+1'], fields=dict(_a='_.a', _b='_.b', a='a', b='b')) | [] print(f"{Fs=}") # In this setup, repeated `from` transforms produce cross-products. # Think of them as nested for loops in list iterator context. print('X x Ys:', from_(X) | from_(Ys) | []) ```
grihabor commented 3 months ago

There is a method called apply in kotlin that allows you to access struct attributes in the current scope as if they were just regular variables:

val adam = Person("Adam").apply {
    age = 32
    city = "London"        
}
println(adam)

Similarly, nix has with expression:

let
  a = {
    x = 1;
    y = 2;
    z = 3;
  };
in
with a; [ x y z ]

In prql this might look like:

from apple               # type: [ {apple = {apple_id, color}} ]
join banana (id)         # type: [ {apple = {apple_id, color}, banana = {banana_id, color}} ]
select (with banana) {
  apple.apple_id,
  banana_id,
  color,
}

So aliases look more logical now:

from apple                       # type: [ {apple = {apple_id, color}} ]
join banana = apple (id)         # type: [ {apple = {apple_id, color}, banana = {apple_id, color}} ]
select (with banana) {
  apple.apple_id,
  banana_id,
  color,
}

wdyt?

max-sixty commented 3 months ago

(I quite like that principle @grihabor! I think that's what we're going for throughout, need to think about the exact semantics / when they get triggered...)

richb-hanover commented 3 months ago

[Please forgive my verbosity - it helps me understand issues to write things out explicitly so I can follow this conversation.]

My (loosey-goosey) understanding of join is that SQL adds the columns of two tables together side-by-side based on the join criteria. The result is a table that doesn't really have any structure, just a bunch of columns. [*]

Here's my understanding of the notation we're using. This query:

from apple
join banana (==id)
join cantaloupe (==id)

would result in the following successive outcomes [*], where all three tables have a column named id, (not apple_id, banana_id, cantaloupe_id):

type: [ { apple = {id, color} } ]
type: [ { apple = {id, color}, banana = {id, color} } ]
type: [ { apple = {id, color}, banana = {id, color}, cantaloupe = {id, color} } ]

To carry my naive understanding further, it's fair to think of the result having six columns: { id, color, id, color, id, color } where the first pair are somehow tagged with the fact that they came from the apple table, the second pair from the banana table, and the third from the cantaloupe table. [*]

If each column had a unique name (say, a_color, b_color, c_color), the underlying SQL engine would be able to choose the proper one unambiguously. [*]

The problem comes when identical column names appear in multiple tables. [*]

So...

  1. Are all my assumptions (marked with [*]) correct? If not, how should I adjust my understanding?
  2. I think this "relation alias" discussion is about how a PRQL query can refer to the columns of the result table by the "place they came from".

Do I have this right? Many thanks.