apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.39k stars 1.21k forks source link

Improve error message for invalid aggregate queries #12006

Open 2010YOUY01 opened 3 months ago

2010YOUY01 commented 3 months ago

Is your feature request related to a problem or challenge?

This query is invalid because its order by clause has an aggregated column

select min(v2) from t1 group by v1 order by v2;

However its error message is confusing (run in datafusion-cli:)

DataFusion CLI v41.0.0
> create table t1(v1 int, v2 int);
0 row(s) fetched.
Elapsed 0.073 seconds.

> select min(v2) from t1 group by v1 order by v2;
Schema error: No field named t1.v2. Valid fields are "min(t1.v2)".

Reference for duckdb's error message which is more understandable

D select min(v2) from t1 group by v1 order by v2;
Binder Error: column "v2" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(v2)" if the exact value of "v2" is not important.
LINE 1: ...t min(v2) from t1 group by v1 order by v2;

Describe the solution you'd like

Use error message similar to duckdb's

Note use different aggregate function (as suggested by DuckDB's error message to use ANY_VALUE) is not supported for DataFusion (yet), we should change this part of the error message

-- Allowed
select min(v2) from t1 group by v1 order by v1;
-- Allowed
select min(v2) from t1 group by v1 order by min(v2);
-- Not supported yet by DataFusion, but widely supported by other systems
select min(v2) from t1 group by v1 order by first_value(v2);

Describe alternatives you've considered

No response

Additional context

No response

jonathanc-n commented 1 month ago

take

jonathanc-n commented 1 month ago

@2010YOUY01 Might need some help with this one. From my understanding the function is right now being called when the type coercion rule is being called during the optimize phase. Do we want to set up a check during the sql phase to make sure that the order by expression appears either as an aggregate in the select or part of the group by clause?

When I think of implementing that, it involves going to the select_to_plan function and doing the check. Would this be fine? it seems a little costly though

2010YOUY01 commented 1 month ago

@2010YOUY01 Might need some help with this one. From my understanding the function is right now being called when the type coercion rule is being called during the optimize phase. Do we want to set up a check during the sql phase to make sure that the order by expression appears either as an aggregate in the select or part of the group by clause?

When I think of implementing that, it involves going to the select_to_plan function and doing the check. Would this be fine? it seems a little costly though

Thank you for the recent bug fixes. I'm not familiar with this part of code either. My understanding of your concern is: this piece of check is during optimize phase now, but it's more reasonable to be done in sql(planning) phase? If that's the case, the current implementation may be due to some historical reasons. I think it's better to keep all related logic in one place, so it's okay to do that during optimization, and also good to refactor everything to a more reasonable place (if this is easy to do)

jonathanc-n commented 1 month ago

Figuring out the optimization part is tricky, as the error is being called during the optimization traversal. I'll unassign myself as I want to currently learn another part of the codebase first.