Error creating time Graphs using Impala data source

paulobitar commented 8 months ago

Bug description

When I try to create a time Graphs, I get an erro message "AnalysisException: select list expression not produced by aggregation output (missing from GROUP BY clause?): trunc(field_name, 'MONTH'). Iḿ connecting to Impala. I Got the SQL generated and the problem is that Superset doesn´t include the field name.

This is the Query:

When I tried to run this query on impala, I saw that the correct would be:

How to reproduce the bug

Create a time graph (any kind, it just can´t be the legacy one) Choose a Impala dataset in the x-axis, select a datetime field Select a metric. Update the chart.

inspect the query.

Go to sql labs and change the query, including the name of the field used in trunc.

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

[X] I have searched Superset docs and Slack and didn't find a solution to my problem.
[X] I have searched the GitHub issue tracker and didn't find a similar bug report.
[ ] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

rusackas commented 3 months ago

Anyone still facing this? I don't know of anyone using Impala, so this is hard to troubleshoot, and is at risk of being closed as stale. Perhaps before that happens, @dosu-bot can help a bit.

dosubot[bot] commented 3 months ago

Hey, @rusackas! 👋 I'm here to help you out while we wait for a human maintainer. I'm a bot that can assist you with bugs, questions, and becoming a contributor.

To resolve the issue where the generated SQL query for an Impala data source does not include the field name in the GROUP BY clause when truncating dates, you need to ensure that the field name is included in the GROUP BY clause. This can be done by modifying the query generation logic to include the field name explicitly.

Here is the relevant part of the code that constructs the GROUP BY clause:

if groupby_all_columns:
    qry = qry.group_by(*groupby_all_columns.values())

To ensure that the field name is included, you need to make sure that groupby_all_columns is populated correctly. This is done in the following section:

if need_groupby:
    # dedup columns while preserving order
    columns = groupby or columns
    for selected in columns:
        if isinstance(selected, str):
            # if groupby field/expr equals granularity field/expr
            if selected == granularity:
                table_col = columns_by_name[selected]
                outer = table_col.get_timestamp_expression(
                    time_grain=time_grain,
                    label=selected,
                    template_processor=template_processor,
                )
            # if groupby field equals a selected column
            elif selected in columns_by_name:
                outer = self.convert_tbl_column_to_sqla_col(
                    columns_by_name[selected],
                    template_processor=template_processor,
                )
            else:
                selected = validate_adhoc_subquery(
                    selected,
                    self.database_id,
                    self.schema,
                )
                outer = literal_column(f"({selected})")
                outer = self.make_sqla_column_compatible(outer, selected)
        else:
            outer = self.adhoc_column_to_sqla(
                col=selected, template_processor=template_processor
            )
        groupby_all_columns[outer.name] = outer
        if (
            is_timeseries and not series_column_labels
        ) or outer.name in series_column_labels:
            groupby_series_columns[outer.name] = outer
        select_exprs.append(outer)

Ensure that the outer variable, which represents the column to be included in the GROUP BY clause, is correctly assigned and added to the groupby_all_columns dictionary. This will ensure that the field name is included in the GROUP BY clause, thus resolving the AnalysisException ^[1].

Additionally, you can customize the SQL query generation using the SQL_QUERY_MUTATOR configuration option in the superset/config.py file. This function allows you to modify the SQL query string before it is executed:

def SQL_QUERY_MUTATOR(sql: str, **kwargs: Any) -> str:
    return sql

By implementing this function, you can ensure that the GROUP BY clause includes the necessary field names based on your specific requirements ^[2].

To continue talking to Dosu, mention @dosu.

apache / superset