apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.34k stars 3.21k forks source link

[Proposal] Support approximate TopN function for Doris #4674

Open Youngwb opened 3 years ago

Youngwb commented 3 years ago

Is your feature request related to a problem? Please describe. It is a common scenario for a database to calculate top-n from a dataset, sort a column and return the Top elements. TopN could uses an approximation algorithms to quickly return results with little overhead.

There are many algorithms to quickly calculate frequent items from the data set, among which the SpaceSaving algorithm has a good performance for reference [1]. Kylin and PostgresSQL have implemented TopN approximation algorithms based on this algorithm and have performed well. We can implement this algorithm in Doris to compute TopN quickly.

Describe the solution you'd like Kylin's current implementation computes the topN of a measure column based on dimensional columns, such like group by dimension_cols order by measure_column limit 10 reference while PostgresSQL uses the topN function to computes the frequent entries of a column. reference

For the current Doris, the lack of UDTF support makes it difficult to calculate topN on dimension columns and measure column like Kylin. So it is more appropriate to calculate topN using an aggregate function like PostgresSQL, and present the results in JSON format.

eg:

CREATE TABLE popular_ad
(
  event_date date NOT NULL ,
  ad_id BIGINT NOY NULL
);

For example, we can calculate the ad_id that is displayed or clicked most per day.

select event_date, topn(ad_id, 1)  as top_1 from popular_ad group by event_date;
               event_date         |      top_1
----------------------------------+--------------------
2020-01-01                        |  {"xxxx":"10000"}
2020-01-02                        |  {"xxx":"11111"}

The result represents the top1 item and its corresponding frequency.

In the future, with Doris supporting UDTF, we can display item and frequency based on this function, or topN for the SUM aggregate type column group by key columns

Additional context [1] Efficient Computation of Frequent and Top-k Elements in Data Streams by Metwally, Agrawal, and Abbadi.

morningman commented 3 years ago

Hi @Youngwb Good proposal. Could you please send this proposal as a email to dev@doris.apache.org? That could help Doris be more in Apache Way.