apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.97k stars 13.94k forks source link

Group results not displayed in an "others" category #579

Closed IljaKroonen closed 5 years ago

IljaKroonen commented 8 years ago

I would be interested in some functionality to group results that cannot be show in some "others" category (could be enabled with a checkbox on the slice).

This could be done at query level. Instead of doing:

SELECT TOP 10 distributor, sum(admissions) as admissions FROM TicketSales
GROUP BY distributor
ORDER BY sum(admissions) DESC;

Caravel could do:

SELECT *
FROM
(SELECT TOP 9 distributor, sum(admissions) as admissions FROM TicketSales
GROUP BY distributor ORDER BY sum(admissions) DESC
UNION
SELECT 'Others' as distributor, sum(admissions) as admissions FROM TicketSales
WHERE distributor NOT IN 
    (SELECT TOP 9 distributor FROM TicketSales
    GROUP BY distributor
    ORDER BY sum(admissions) DESC)) t
ORDER BY admissions DESC;

Resulting in the 9 distributors with the most tickets sold and an "others" row with the total of all tickets sold by other distributors.

Would this be achievable with the libraries currently used by caravel? I have been looking at the query method in models.py and this seems like a complex change.

mistercrunch commented 8 years ago

Yes that'd be very nice. Here's a better way to write this sql with 2 scans instead of 3:

SELECT COALESCE(b.distributor, 'Others'), sum(admissions) as admissions 
FROM TicketSales ts
LEFT JOIN (
    SELECT TOP 10 distributor FROM TicketSales
    GROUP BY distributor
    ORDER BY sum(admissions) DESC;
) as subq ON ts.distributor = subq.distributor
GROUP BY COALESCE(b.distributor, 'Others')
ORDER BY sum(admissions) DESC;

For the "Series Limit" feature we're already doing this essentially, except with an INNER JOIN. Showing "Others" there should be easy-ish. You have to change the join to a left join and do this COALESCE thing.

IljaKroonen commented 8 years ago

True, your query resolved to a better execution place at least in my database :)

We needed this feature quickly and because I don't really understand how caravel generates queries (I am not familiar with the libraries involved, especially around druid), we decided to implement it directly with pandas in viz.py (post query). If this interests you I can put the pull request online once we're done, but IMHO query level would be better to do this. FYI, it looks like this in all visualizations that have a LIMIT option.

image

guohan commented 7 years ago

@IljaKroonen Hi, may i ask a question about the 'others category' feature that why cannot find the relational code in superset object? Is it not supported yet :-(