apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.34k stars 3.21k forks source link

surrogate key gerneration requested. #545

Closed lenmom closed 5 years ago

lenmom commented 5 years ago

scenario description: supose we have a situation in dim_sales_person(employee_id int , department int,...), and the fact table fact_sales(order_id int, sales_person_id, department_id int,...).

in case we calculate the detapartment sales amount based on the fact table fact_sales, if the sales person have changed department and we have no surrogate key gerneration mechanism,so all the sales metric for the sales person would be calculated to the persion's current department, and this is incorrect.

a common solution for this scenario is to generate a new record for sales person who has changed department, and generate new record and insert to the dim_sales_person table, and the fact table would reference the new generated record's primary key. and set the previous record expire time to the day when the person's department has changed. the table schema look as follows:

CREATE TABLE dim_sales_person ( sales_person_sk int , employee_id int , department varchar(128), . . . effective_date date, expiry_date date )

this is called slow changed dimention type 2, and we should record the changing attibute history.

In hive, it has already support such feture, and the solution looks like as follows:

-- 将有地址变化的插入到dim_customer表,如果有相同数据存在有不过期的数据则不插入 INSERT INTO dw_sales.dim_customer SELECT row_number() over (ORDER BY t1.customer_number) + t2.sk_max, t1.customer_number, t1.customer_name, t1.customer_street_address, t1.customer_zip_code, t1.customer_city, t1.customer_state, t1.version, t1.effective_date, t1.expiry_date FROM(SELECT t2.customer_number customer_number, t2.customer_name customer_name, t2.customer_street_address customer_street_address, t2.customer_zip_code, t2.customer_city, t2.customer_state, t1.version + 1 version, ${hivevar:pre_date} effective_date, ${hivevar:max_date} expiry_date FROM dw_sales.dim_customer t1 INNER JOIN ods_sales.ods_customer t2 ON t1.customer_number = t2.customer_number AND t1.expiry_date = ${hivevar:pre_date} LEFT JOIN dw_sales.dim_customer t3 ON t1.customer_number = t3.customer_number AND t3.expiry_date = ${hivevar:max_date} WHERE t1.customer_street_address <> t2.customer_street_address AND t3.customer_sk IS NULL ) t1 CROSS JOIN (SELECT COALESCE(MAX(customer_sk),0) sk_max FROM dw_sales.dim_customer) t2;

the key implementation is: row_number() over (ORDER BY t1.customer_number) + t2.sk_max

via such solution, hive can handle surrogate key gerneration.

In summery: because slow change dimention is an important feature in dataware house modeling, so it's suggested to support surrogate key gerneration in doris.

EmmyMiao87 commented 5 years ago

Doris does not support unpartition in analytic function. So your query of explain select 1+row_number() over (order by i_item_sk) from item in doris will throw error.

But surrogate key generation already has been support in doris. The query select 1+row_number() over (partition by i_category order by i_item_sk) from item; can be explain to query plan. You can try it.

Maybe you need to change issue named to Analytic function does not support UNPARTITION.

I will continue to find the reason of UNPARTITION has not been support in analytic function @lenmom

EmmyMiao87 commented 5 years ago

@lenmom Already remove the restrict on https://github.com/apache/incubator-doris/pull/659

EmmyMiao87 commented 5 years ago

@chenhao7253886 please close this issue ^0^