Open fhyy opened 1 month ago
To address the issue with Apache Drill and GROUP BY
statements in Apache Superset, it's essential to understand that Drill might not support using aliases in GROUP BY
that are defined in the SELECT
part of the query for aggregation purposes directly. Instead, you should directly use the aggregation expression in the GROUP BY
clause or reference the original column being aggregated.
For instance, if your query is:
SELECT length(n_name) AS n_name
FROM
(select * from cp.`tpch/nation.parquet`)
GROUP BY length(n_name)
LIMIT 10;
Modify it to avoid using the alias in the GROUP BY
clause:
SELECT length(n_name) AS n_name
FROM
(select * from cp.`tpch/nation.parquet`)
GROUP BY n_name
LIMIT 10;
This adjustment ensures compatibility with Apache Drill's SQL parsing and execution behavior, working around limitations related to the use of aliases in GROUP BY
clauses for aggregated expressions.
Regarding appending "__" to aliases, this solution was specific to ORDER BY
statements to ensure compatibility with the database engine's requirements. While a similar approach could theoretically apply to GROUP BY
statements, it's crucial to test any such changes thoroughly to ensure they do not introduce compatibility issues with other supported database engines in Superset.
If you encounter further issues or have additional questions, feel free to ask for more assistance.
To continue the conversation, mention @dosu.
I have implemented a fix based on the suggestion. This can be found in PR #29455 .
I'm currently maintaining an image with this fix for my company, keeping it up to date with your changes. I would greatly appreciate it if this issue can be resolved so that I can use the official versions.
Bug description
GROUP BY
statements usually do not support the use of aliases, but it is supported in Drill since 2019. This added support results in the alias being picked inGROUP BY
statements instead of the source column if they have the same name. This is problematic if the alias refers to an aggregation of the source column.These types of statements are generated when creating a chart and aggregating the values of an axis:
Testing this statement in the SQL Lab against a Drill connection results in the following error:
and an exception in the logs:
The problem is the statement
GROUP BY length(n_name)
. The aggregationlength(n_name)
refers to the source columnn_name
, but the Drill engine uses the aliasn_name
instead. The resulting executed statement is thusGROUP BY length(length(n_name))
, which returns no data.Suggestion
There was a similar issue with
ORDER BY
statements which was fixed by appending "__" to the name of such aliases, ensuring that the source column is used. This was done in PR 13739 (https://github.com/apache/superset/pull/13739)The same fix can also be applied for
GROUP BY
statements, and the problematic statement above would instead look like this:How to reproduce the bug
Screenshots/recordings
No response
Superset version
4.0.0
Python version
Not applicable
Node version
Not applicable
Browser
Firefox
Additional context
From the logs:
Checklist