apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Support multiple order aware aggregate functions in a query #8582

Open alamb opened 6 months ago

alamb commented 6 months ago

Is your feature request related to a problem or challenge?

Today DataFusion supports three aggregate functions that can be "order aware": ARRAY_AGG, FIRST_VALUE and LAST_VALUE. This means that you can supply a ORDER BY clause to their argument, for example FIRST_VALUE(x ORDER BY time).

Today, there be only one single order specified across ALL order aware aggregate functions

For example

❯ create table t(x int, y int) as values (1, 1), (1, 2), (1, 1), (1, 4), (2, 20), (2, 10);;
0 rows in set. Query took 0.003 seconds.

❯ select x, first_value(x ORDER BY y) from t GROUP BY x;
+---+------------------+
| x | FIRST_VALUE(t.x) |
+---+------------------+
| 2 | 2                |
| 1 | 1                |
+---+------------------+
2 rows in set. Query took 0.004 seconds.

❯ select x, first_value(x ORDER BY y), first_value(x ORDER BY y DESC) from t GROUP BY x;
+---+------------------+-----------------+
| x | FIRST_VALUE(t.x) | LAST_VALUE(t.x) |
+---+------------------+-----------------+
| 1 | 1                | 1               |
| 2 | 2                | 2               |
+---+------------------+-----------------+
2 rows in set. Query took 0.004 seconds.

❯ select x, first_value(x ORDER BY y), first_value(x ORDER BY y DESC NULLS LAST) from t GROUP BY x;
This feature is not implemented: Conflicting ordering requirements in aggregate functions is not supported

Describe the solution you'd like

There are a few designs proposed here: https://github.com/apache/arrow-datafusion/pull/8558#issuecomment-1862649886

We are working on a more detailed proposal

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 6 months ago

Here is a design document @mustafasrepo and I are working on: https://docs.google.com/document/d/1cIIJL6RKXKge-t8z4Rs_F-XH82giWQfmciwj-h3tho4/edit

alamb commented 6 months ago

@mustafasrepo and @ozankabak -- I went over and added a third design option (to move the order awareness into the aggregators), which I would like to consider as well. I think it is likely to perform signfiicantly faster for many queries as well as keep the HashAggregateExec simpler (though it makes the aggregators themselves potentially more complicated)

Can you review https://docs.google.com/document/d/1cIIJL6RKXKge-t8z4Rs_F-XH82giWQfmciwj-h3tho4 and let me know what you think?

alamb commented 5 months ago

FYI @mustafasrepo and @ozankabak -- I filed https://github.com/apache/arrow-datafusion/issues/8777, about reusing CTEs, as it came up in other context, and also could potentially be used for solution 2 in the the design document https://docs.google.com/document/d/1cIIJL6RKXKge-t8z4Rs_F-XH82giWQfmciwj-h3tho4/edit