apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.23k stars 13.65k forks source link

Problem performing calculations with Duplicated dimensions #30507

Open xgraulap opened 1 week ago

xgraulap commented 1 week ago

Bug description

Good afternoon, I have a problem when using Superset to perform calculations. I have my data in a unique table, with Procedure ID as Surrogate Key and some dimensions adding information of the procedures. Also I have numerator and denominator calculated for every Procedure ID to get the result of the KPI using the following metric:

sum(numerator)/sum(denominator).

The thing is that if I do the calculation like this, the calculation is done without taking into account duiplications in the dimensions, so the results become incorrect.

The issue is that Superset doesen't allow me to add SUBSELECT to the metrics, so I can't perform my calculations over the total number of different Procedure ID that accomplish my filters.

I proposed a metric like this:

SUM(numerator)/SUM(denominator) FROM ( SELECT DISTINCT procedure, numerator, denominator FROM indicatorfourancillary ) AS unique_procedures

But it doesn't meet my expectations.

So I would like to know if in Superset, there's a way of doing first a SELECT of the unique procedure ID that meet my filters, and then perform the calculations over them.

Thank you so much in advance, @villebro, @rusakas y @betodealmeida

Screenshots/recordings

Duplications

This is my actual state, this LOT should give me 0/1 and now I'm getting 0/4.

Superset version

master / latest-dev

Python version

I don't know

Node version

16

Browser

Chrome

Additional context

No response

Checklist

dosubot[bot] commented 1 week ago

Hello, @xgraulap! I'm here to help you with any bugs, questions, or contributions you have for Superset. Let's tackle this issue together!

To address the issue of performing calculations with duplicate dimensions in Superset, you can consider the following approach:

  1. Enable Adhoc Subqueries: If you want to use subqueries in your metrics, you need to enable the 'ALLOW_ADHOC_SUBQUERY' feature in Superset. This can be done by adding the following configuration to your superset_config.py file:

    FEATURE_FLAGS = {
       "ALLOW_ADHOC_SUBQUERY": True
    }

    This will allow you to use subqueries in your custom SQL fields, which can help in handling duplicate dimensions by allowing more complex SQL operations [1].

  2. Create a View in the Database: Another approach is to create a view in your database that pre-aggregates or filters the data to remove duplicates. You can then use this view as a data source in Superset. This way, you can ensure that the calculations are performed on the correct set of unique data.

  3. Use SQL Lab for Custom Queries: You can use SQL Lab in Superset to write custom SQL queries that handle duplicates as needed. Once you have the desired result, you can save the query as a virtual dataset and use it for your visualizations.

These methods should help you perform calculations over the total number of distinct Procedure IDs that meet your filters, ensuring accurate KPI results.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

@dosu could I add this command

FEATURE_FLAGS = { "ALLOW_ADHOC_SUBQUERY": True }

directly to the dashboard Advanced config?

dosubot[bot] commented 1 week ago

No, the ALLOW_ADHOC_SUBQUERY feature flag cannot be set directly through the dashboard Advanced config. It needs to be added to the superset_config.py file to take effect [1].

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

@dosu and after adding this to my Superset config, how can I do subqueries on the metrics without getting an error when creating the charts?

dosubot[bot] commented 1 week ago

After enabling the ALLOW_ADHOC_SUBQUERY feature in your superset_config.py, you can use subqueries in your metrics. However, you need to ensure that your SQL syntax is correct and that the subqueries are properly structured. Here are some steps to help you avoid errors:

  1. Write the Subquery in SQL Lab: Start by writing and testing your subquery in SQL Lab to ensure it returns the expected results. This helps in debugging any SQL syntax issues before using it in a metric.

  2. Use the Subquery in a Virtual Dataset: Once your subquery works in SQL Lab, you can save it as a virtual dataset. This allows you to use the results of the subquery as a data source for your charts.

  3. Reference the Virtual Dataset in Metrics: When creating metrics, reference the virtual dataset instead of writing the subquery directly in the metric. This approach simplifies the metric definition and reduces the chance of errors.

By following these steps, you should be able to use subqueries in your metrics without encountering errors when creating charts.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

The chart is shown like this

SELECT CAST(SUM(CAST(numerator AS DECIMAL(10, 3))) / SUM(denominator) AS DECIMAL(10, 3)) FROM (SELECT DISTINCT procedure) AS "Is Joint Procurement" FROM indicatorfourancillary LIMIT 50000;

when I put this metric CAST( SUM(CAST(numerator AS DECIMAL(10,3))) / SUM(denominator) AS DECIMAL(10,3)) FROM (SELECT DISTINCT procedure)

but it's not correct @dosu

dosubot[bot] commented 1 week ago

Could you provide more details about the error or issue you're encountering with this query? Are there any specific error messages or unexpected results? Additionally, could you clarify what you mean by "it's not correct"? This information will help in diagnosing the problem more effectively.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

If I use the metric wrote before, I get this error

Unexpected error Error: syntax error at or near "FROM" LINE 4: FROM indicatorfourancillary ^

So I want to know how to add the subquery @dosu

dosubot[bot] commented 1 week ago

The error you're encountering is due to incorrect SQL syntax. In Superset, when defining a metric, you cannot directly include a FROM clause within the metric definition itself. Metrics should be simple expressions that operate on the columns of the dataset.

To achieve your goal, you should first create a virtual dataset using SQL Lab with the subquery you want. Here's how you can do it:

  1. Create a Virtual Dataset:

    • Go to SQL Lab and write your subquery to select distinct procedures and calculate the numerator and denominator.
    • For example:
      SELECT DISTINCT procedure, numerator, denominator
      FROM indicatorfourancillary
    • Run the query to ensure it returns the expected results.
    • Save this query as a virtual dataset.
  2. Use the Virtual Dataset in a Chart:

    • Create a new chart and select the virtual dataset you just created as the data source.
    • Define your metric as SUM(numerator) / SUM(denominator) without any subquery, since the virtual dataset already handles the distinct selection.

By following these steps, you can effectively use subqueries in your analysis without encountering syntax errors in the metric definition.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

If I do this, I am not able to apply filters with my dimensions. This is not a good approach for my case @dosu I need to work with my actual dataset but using only DISTINCT procedures

xgraulap commented 1 week ago

If I do this, I am not able to apply filters with my dimensions. This is not a good approach for my case @dosu I need to work with my actual dataset but using only DISTINCT procedures

dosubot[bot] commented 1 week ago

To work with your actual dataset while using only distinct procedures and still be able to apply filters on your dimensions, you can try the following approach:

  1. Create a Derived Table in SQL Lab: Use SQL Lab to create a derived table that selects distinct procedures along with the necessary dimensions and metrics. For example:

    SELECT DISTINCT procedure, dimension1, dimension2, numerator, denominator
    FROM indicatorfourancillary

    Ensure that this query includes all the dimensions you need for filtering.

  2. Save as a Virtual Dataset: Save this query as a virtual dataset. This dataset will now have distinct procedures and include the dimensions you want to filter on.

  3. Use the Virtual Dataset in a Chart: When creating a chart, use this virtual dataset as your data source. You can now apply filters on the dimensions as needed and define your metric as SUM(numerator) / SUM(denominator).

This approach allows you to maintain the distinct procedures while still being able to filter on your dimensions.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

This is what I want to evitate in fact @dosu

I put this command in my config

FEATURE_FLAGS = { "ALLOW_ADHOC_SUBQUERY": True }

in order to be able to do SUBSELECT in my metrics. Could you help me on how to add this unicity directly in my metric?

dosubot[bot] commented 1 week ago

In Superset, even with the ALLOW_ADHOC_SUBQUERY feature enabled, you cannot directly include a FROM clause or subquery within a metric definition. Metrics are designed to be simple expressions that operate on the columns of the dataset.

To achieve the desired unicity directly in your metric, you can use a workaround by leveraging SQL Lab to create a virtual dataset that incorporates the distinct logic. Here's a refined approach:

  1. Create a Virtual Dataset with Distinct Logic:

    • In SQL Lab, write a query that selects distinct procedures along with the necessary dimensions and metrics. For example:
      SELECT DISTINCT procedure, dimension1, dimension2, numerator, denominator
      FROM indicatorfourancillary
    • Ensure this query includes all dimensions you need for filtering.
  2. Save as a Virtual Dataset:

    • Save this query as a virtual dataset. This dataset will now have distinct procedures and include the dimensions you want to filter on.
  3. Use the Virtual Dataset in a Chart:

    • Use this virtual dataset as your data source when creating a chart.
    • Define your metric as SUM(numerator) / SUM(denominator) without any subquery, since the virtual dataset already handles the distinct selection.

This method allows you to maintain distinct procedures while still being able to filter on your dimensions, effectively achieving the unicity you need in your metrics.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

@dosu can I use the GROUP BY in a BIG NUMBER chart?

dosubot[bot] commented 1 week ago

Yes, you can use the GROUP BY clause in a Big Number chart in Superset. However, keep in mind that Big Number charts are typically used to display a single aggregated value. If you use GROUP BY, it will generate multiple rows, which might not be suitable for a Big Number chart.

If you need to aggregate data with GROUP BY and still want to display a single value, consider using a subquery or a virtual dataset to pre-aggregate the data. Then, select the specific aggregated value you want to display in the Big Number chart.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

@dosu how can I use this GROUP BY?

dosubot[bot] commented 1 week ago

To use GROUP BY in a Big Number chart, you need to pre-aggregate your data before using it in the chart. Here's how you can do it:

  1. Use SQL Lab:

    • Write a query in SQL Lab that includes the GROUP BY clause to aggregate your data as needed.
    • For example:
      SELECT procedure, SUM(numerator) AS total_numerator, SUM(denominator) AS total_denominator
      FROM indicatorfourancillary
      GROUP BY procedure
    • Run the query to ensure it returns the expected results.
  2. Save as a Virtual Dataset:

    • Save this query as a virtual dataset. This dataset will now have the aggregated data you need.
  3. Create a Big Number Chart:

    • Use this virtual dataset as the data source for your Big Number chart.
    • Select the specific aggregated value you want to display, such as SUM(total_numerator) / SUM(total_denominator).

By pre-aggregating your data with GROUP BY in SQL Lab and using a virtual dataset, you can effectively use the results in a Big Number chart.

To continue talking to Dosu, mention @dosu.

xgraulap commented 1 week ago

I'm not getting a response fitting my need. Could you help @betodealmeida @rusackas @villebro ?