apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.38k stars 13.7k forks source link

[pivot table] Query with ROW LIMIT #13562

Closed graceguo-supercat closed 2 years ago

graceguo-supercat commented 3 years ago

How to reproduce the bug

  1. Create a PivotTable chart,
  2. Add ROW LIMIT = 200
  3. Check generated query by click VIEW QUERY

Actual results

SELECT "dim_destination_geo" AS "dim_destination_geo",
       date_trunc('week', CAST("night_in_latest_year" AS TIMESTAMP)) AS "night_in_latest_year",
       ...
FROM "tmp"."forward_occupancy_forecast_superset_table"
WHERE "night_in_latest_year" >= '2021-03-11 00:00:00.000000'
  AND "night_in_latest_year" < '2021-06-15 00:00:00.000000'
ORDER BY "Nights Backlog" DESC
LIMIT 200;

Instead of filtering out the top K rows in the superset view, it filters out the top K rows in the input data; this results in the pivot table having lots of missing cells.

Expected results

Superset should generate a sub-query with limit like this:

SELECT "dim_destination_geo" AS "dim_destination_geo",
       date_trunc('week', CAST("night_in_latest_year" AS TIMESTAMP)) AS "night_in_latest_year",
       ...
FROM "tmp"."forward_occupancy_forecast_superset_table"
WHERE "night_in_latest_year" >= '2021-03-11 00:00:00.000000'
  AND "night_in_latest_year" < '2021-06-15 00:00:00.000000'
       AND (dim_destination_geo IN
              (SELECT dim_destination_geo
               FROM
                 (SELECT dim_destination_geo,
                         sum(current_forward_nights_occupied)
                  FROM tmp.forward_occupancy_forecast_superset_table
                  WHERE is_most_recent_view = True
                  GROUP BY 1
                  order by 2 DESC
                  LIMIT 200))))
ORDER BY "Nights Backlog" DESC
LIMIT 50000;

Environment

latest master branch

Checklist

Make sure to follow these steps before submitting your issue - thank you!

Additional context

Add any other context about the problem here. cc @junlincc @zuzana-vej

junlincc commented 3 years ago

Hi @graceguo-supercat, can you attach a screenshot with all the control select. I assume you set filters and groud-by as well?

Couple PRs touched Pivot table lately. may need some help debugging. sorry about the regression. https://github.com/apache-superset/superset-ui/pull/954 https://github.com/apache/superset/pull/13057 (more likely?)

cc @villebro

villebro commented 3 years ago

I don't believe this is a regression - this has most likely always been the case. I agree that the proposal is the correct way to filter the data. I believe we should be able to do this by leveraging the timeseries_limit (although I'd really like to refactor this so that the feature isn't related to timeseries): https://github.com/apache/superset/blob/5fca19da565cae1e70a0dff66007828dbe9fc8ed/superset/connectors/sqla/models.py#L1177-L1224

graceguo-supercat commented 3 years ago

@junlincc this is controls list:

Screen Shot 2021-03-11 at 10 17 33 AM
john-bodley commented 3 years ago

@villebro are there any concerns of merely adding a SERIES LIMIT control per your suggestion?

john-bodley commented 3 years ago

Actually @villebro adding the SERIES LIMIT might not be the right approach, i.e., when prototyping the change the limit adheres to the row limit of the groupings of both the GROUP BY and COLUMNS which is likely not what the user is after. Grepping through the code it seems non trivial to decouple this logic.

Screen Shot 2021-06-28 at 10 37 14 AM