apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.89k stars 1.11k forks source link

Add min_by and max_by aggregate functions #12075

Open timsaucer opened 3 weeks ago

timsaucer commented 3 weeks ago

Is your feature request related to a problem or challenge?

It is a common need to get the value of one column such that another column is a minimum. For example, if I have a column of fruit_name and price_per_pound I might want to get the fruit_name for which price_per_pound is a minimum. This can be done with existing functions, but it should be both more performant and more user friendly to add these functions.

Describe the solution you'd like

Ideally this would take two expressions. The first would be the expression you want to return and the second would be the expression that we are looking for the min/max of. In my example we would do something like

min_by(col("fruit_name"), col("price_per_pound"))

Describe alternatives you've considered

Right now I would probably do a first_value with an order_by. This will introduce an unnecessary sort of the dataframe.

Additional context

Example from

https://docs.databricks.com/en/sql/language-manual/functions/min_by.html

alamb commented 3 weeks ago

We have similar special functions for this type of function in InfluxDB (needed to get the max by time, for example) in case that helps this

We call them selector functions. Here are the docs https://docs.influxdata.com/influxdb/cloud-serverless/reference/sql/functions/selector/

Here is the code https://github.com/influxdata/influxdb3_core/blob/main/query_functions/src/selectors/internal.rs

korowa commented 2 weeks ago

There are already aggregation variants of first/last which seem to solve the issue (example), and, at first glance, they do not perform normal sorting, only compare incoming ordering column values with "accumulated" ones.

It also seems that their performance could be improved by implementing GroupsAccumulator for them (or for majority of input types at least), as their state is somehow similar to AVGs state in terms of comlexity.

alamb commented 2 weeks ago

FIRST_VALUE / LAST_VALUE with ORDER BY is a great call. Here is the code if anyone is interested:

https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs

korowa commented 2 weeks ago

More like https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/first_last.rs (I suppose)

korowa commented 2 weeks ago

My bad, it still sorts input data (during e.g. get_first_idx, which is likely a redundant sort, except for the cases when input is already sorted by ordering columns). That is another thing to improve.

timsaucer commented 2 weeks ago

As I'm thinking about it, I'm not sure you can get around doing the sort since you have an arbitrary number of ordering clauses. I think what you've proposed is the best option. One could speed up the operation when there's only a single order clause but that may not be worth the effort involved.

I'm okay closing this issue unless anyone thinks there's value in pursuing it further.

alamb commented 2 weeks ago

As I'm thinking about it, I'm not sure you can get around doing the sort since you have an arbitrary number of ordering clauses. I think what you've proposed is the best option. One could speed up the operation when there's only a single order clause but that may not be worth the effort involved.

FWIW this is what the first_selector / last_selector functions do in InfluxDB -- they basically maintain only the value for the largest ORDER BY column, rather than actually sorting the entire intermediate result

I'm okay closing this issue unless anyone thinks there's value in pursuing it further.

It would be nice to make it clearer how to get the equivalent of min_by and max_by -- maybe we could just document the equivalent first_value(... ORDER BY ...) queries 🤔 or we could automatically rewrite min_by / max_by into first_value() / last_value 🤔

It is probably also worth filing a ticket for the potential optimization

korowa commented 2 weeks ago

I guess user guide will be the best option since naming for this function may vary in different DBs -- min_by / argmin / argmin_agg, and maybe couple of others, so aliasing is not that straightforward.