MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.71k stars 465 forks source link

Table functions + window functions + aggregates in the same `SELECT` clause #20979

Open ggevay opened 1 year ago

ggevay commented 1 year ago

What version of Materialize are you using?

26b5206857bdd1186a066759fbce863ca6d30985

What is the issue?

When a query has table functions, window functions, and aggregates, then Postgres evaluates them in the following order:

  1. aggregates
  2. window functions
  3. table functions
CREATE TABLE bar (a int);
INSERT INTO bar VALUES (0), (1), (2), (3);

We have two problems:

We bail out when a query has both a table function and an aggregate

Materialize:

materialize=> SELECT sum(a), generate_series(1, sum(a))
FROM bar
GROUP BY a%2;
ERROR:  aggregate functions are not allowed in table function arguments (function pg_catalog.sum)
materialize=> SELECT sum(a), generate_series(1, a%2)
FROM bar
GROUP BY a%2;
ERROR:  column "table_func_e9f06e7a-b84c-43f4-8a9f-69eda6bdf999" does not exist

Postgres:

gabor=# SELECT sum(a), generate_series(1, sum(a))
FROM bar
GROUP BY a%2;
 sum | generate_series 
-----+-----------------
   2 |               1
   2 |               2
   4 |               1
   4 |               2
   4 |               3
   4 |               4
(6 rows)

gabor=# SELECT sum(a), generate_series(1, a%2)
FROM bar
GROUP BY a%2;
 sum | generate_series 
-----+-----------------
   4 |               1
(1 row)

When a query has just a table function and a window function, then we evaluate these in the opposite order from Postgres

Materialize:

materialize=> SELECT a, generate_series(1, a), lag(a) OVER (ORDER BY a)
FROM bar;
 a | generate_series | lag 
---+-----------------+-----
 1 |               1 |    
 2 |               1 |   1
 2 |               2 |   2
 3 |               1 |   2
 3 |               2 |   3
 3 |               3 |   3
(6 rows)

materialize=> SELECT a, lag(a) OVER (ORDER BY a), generate_series(1, a)
FROM bar;
 a | lag | generate_series 
---+-----+-----------------
 1 |     |               1
 2 |   1 |               1
 2 |   2 |               2
 3 |   2 |               1
 3 |   3 |               2
 3 |   3 |               3
(6 rows)

Postgres:

gabor=# SELECT a, generate_series(1, a), lag(a) OVER (ORDER BY a)
FROM bar;
 a | generate_series | lag 
---+-----------------+-----
 1 |               1 |   0
 2 |               1 |   1
 2 |               2 |   1
 3 |               1 |   2
 3 |               2 |   2
 3 |               3 |   2
(6 rows)

gabor=# SELECT a, lag(a) OVER (ORDER BY a), generate_series(1, a)
FROM bar;
 a | lag | generate_series 
---+-----+-----------------
 1 |   0 |               1
 2 |   1 |               1
 2 |   1 |               2
 3 |   2 |               1
 3 |   2 |               2
 3 |   2 |               3
(6 rows)

Slack discussion: https://materializeinc.slack.com/archives/C02FWJ94HME/p1691091394842719

ggevay commented 1 year ago

Note that this issue will probably be easier to tackle after https://github.com/MaterializeInc/materialize/issues/20746.

ggevay commented 1 year ago

Btw. one can also ask what should happen when a window function is inside a table function or vice versa.

maddyblue commented 1 year ago

This seems related to https://github.com/MaterializeInc/materialize/issues/20533

ggevay commented 9 months ago

Related to https://github.com/MaterializeInc/materialize/pull/22964/files, i.e., that window functions are not supported in table function arguments.

chuck-alt-delete commented 1 month ago

Simplified example of the error a customer ran into:

select distinct unnest(array_agg(x)) from (values (1), (1), (2)) x;
Error: aggregate functions are not allowed in table function arguments (function pg_catalog.array_agg)

This appears to be a somewhat common technique to deduplicate an array in Postgres.

ggevay commented 1 month ago

So, this query starts from individual values as an input, makes an array, blows up the array, and then does a distinct. It seems to me that the unnest(array_agg(x)) is a no-op here.

Does the user start from arrays? If yes, then maybe the nesting is in the other direction, i.e., array_agg(unnest(...))? (But unfortunately, we don't support this nesting direction either at the moment, I think.)

maddyblue commented 1 month ago

It's a no-op here yes. I'm gonna guess that that example has some actual logic that got simplified away for a minimum repro.

chuck-alt-delete commented 1 month ago

There are some nuances lost from the customer’s use case for sure. Their example uses the context from the outer query akin to a lateral join. I think when we see this pattern it almost always means there is a more efficient rewrite, probably involving computing aggregates in a separate view/CTE and then joining back in.

We ended up finding a better rewrite for this particular application that involved using a MAX to determine what label should be applied to a group of records (see slack)

I’d say this customer is unblocked on this particular usage, and I’m not sure if there is any optimization lesson to be learned here. It’s probably good for now that this returns an error because it seems likely we should always be doing something other than unnesting an array_agg.

ggevay commented 1 week ago

The first problem from the issue description ("a query has both a table function and an aggregate") occurred at a user today: https://materializeinc.slack.com/archives/C04SN4BEQ0N/p1724053612520059?thread_ts=1723657127.783599&cid=C04SN4BEQ0N

We could at least make the error msg nicer.