Open itestyoy opened 1 week ago
@itestyoy Thanks for the feedback here! Could you please elaborate more the use case? Cube data model as you see it and target SQL to be generated would be very helpful.
Hi, @paveltiunov!
Cube model:
cube(`orders`, {
sql_table: `dbt.orders`,
data_source: `default`,
dimensions: {
o_orderstatus: {
sql: `o_orderstatus`,
type: `string`
},
o_orderpriority: {
sql: `o_orderpriority`,
type: `string`
},
o_clerk: {
sql: `o_clerk`,
type: `string`
},
o_clerk_country: {
sql: `o_clerk_country`,
type: `string`
},
o_orderdate: {
sql: `o_orderdate`,
type: `time`
}
},
measures: {
o_totalprice: {
sql: `o_totalprice`,
type: `sum`
},
clerk_count: {
sql: `o_clerk`,
type: `count_distinct`
},
cohort_clerk_count: {
multi_stage: true,
sql: `${clerk_count}`,
type: `sum`,
group_by: [o_clerk_country]
}
},
});
SQL API Query:
SELECT
orders.o_orderpriority,
orders.o_clerk_country,
MEASURE(orders.cohort_clerk_count),
MEASURE(orders.clerk_count),
MEASURE(orders.o_totalprice)
FROM
orders
GROUP BY
1,
2
LIMIT
5000;
Each user has 0 or more orders with different order priorities. This part of the code will calculate the total sum of users for each country:
Copy code
"orders__o_orderpriority",
"orders__o_clerk_country",
SUM(SUM("orders__clerk_count")) OVER (PARTITION BY "orders__o_clerk_country") AS "orders__cohort_clerk_count"
However, this is not an accurate user count (the count_distinct
function doesn’t help because it calculates distinct counts over grouped data). We can sum the users by country since each user can belong to only one country but not over priority.
If we exclude o_orderpriority from cte_0 and cte_1 and from WHERE
clause if presented in cte_0 and cte_1, and make an INNER JOIN only by country, we can calculate the real user count in the cohort.
Then we can calculate metrics like ARPU, RR and others.
WITH cte_0 AS (
SELECT
"orders".o_orderpriority "orders__o_orderpriority",
"orders".o_clerk_country "orders__o_clerk_country",
count(distinct "orders".o_clerk) "orders__clerk_count"
FROM
dbt.orders AS "orders"
GROUP BY
1,
2
ORDER BY
3 DESC
),
cte_1 AS (
SELECT
"orders__o_orderpriority" "orders__o_orderpriority",
"orders__o_clerk_country" "orders__o_clerk_country",
sum(sum("orders__clerk_count")) OVER(PARTITION BY "orders__o_clerk_country") "orders__cohort_clerk_count"
FROM
(
SELECT
q_0."orders__o_orderpriority",
q_0."orders__o_clerk_country",
"orders__clerk_count" "orders__clerk_count"
FROM
cte_0 as q_0
ORDER BY
3 DESC
) AS cte_1_join
GROUP BY
1,
2
ORDER BY
3 DESC
)
SELECT
q_0."orders__o_orderpriority",
q_0."orders__o_clerk_country",
"orders__cohort_clerk_count" "orders__cohort_clerk_count",
"orders__clerk_count" "orders__clerk_count",
"orders__o_totalprice" "orders__o_totalprice"
FROM
(
SELECT
"main__orders".o_orderpriority "orders__o_orderpriority",
"main__orders".o_clerk_country "orders__o_clerk_country",
count(distinct "main__orders".o_clerk) "orders__clerk_count",
sum("main__orders".o_totalprice) "orders__o_totalprice"
FROM
dbt.orders AS "main__orders"
GROUP BY
1,
2
) as q_0
INNER JOIN (
SELECT
*
FROM
cte_1
) as q_1 ON (
q_0."orders__o_orderpriority" = q_1."orders__o_orderpriority"
OR (
q_0."orders__o_orderpriority" IS NULL
AND q_1."orders__o_orderpriority" IS NULL
)
)
AND (
q_0."orders__o_clerk_country" = q_1."orders__o_clerk_country"
OR (
q_0."orders__o_clerk_country" IS NULL
AND q_1."orders__o_clerk_country" IS NULL
)
)
ORDER BY
3 DESC
LIMIT
5000
Suggested:
WITH cte_0 AS (
SELECT
"orders".o_clerk_country "orders__o_clerk_country",
count(distinct "orders".o_clerk) "orders__clerk_count"
FROM
dbt.orders AS "orders"
GROUP BY
1
ORDER BY
2 DESC
),
cte_1 AS (
SELECT
"orders__o_clerk_country" "orders__o_clerk_country",
sum(sum("orders__clerk_count")) OVER(PARTITION BY "orders__o_clerk_country") "orders__cohort_clerk_count"
FROM
(
SELECT
q_0."orders__o_clerk_country",
"orders__clerk_count" "orders__clerk_count"
FROM
cte_0 as q_0
ORDER BY
2 DESC
) AS cte_1_join
GROUP BY
1
ORDER BY
2 DESC
)
SELECT
q_0."orders__o_orderpriority",
q_0."orders__o_clerk_country",
"orders__cohort_clerk_count" "orders__cohort_clerk_count",
"orders__clerk_count" "orders__clerk_count",
"orders__o_totalprice" "orders__o_totalprice"
FROM
(
SELECT
"main__orders".o_orderpriority "orders__o_orderpriority",
"main__orders".o_clerk_country "orders__o_clerk_country",
count(distinct "main__orders".o_clerk) "orders__clerk_count",
sum("main__orders".o_totalprice) "orders__o_totalprice"
FROM
dbt.orders AS "main__orders"
GROUP BY
1,
2
) as q_0
INNER JOIN (
SELECT
*
FROM
cte_1
) as q_1 ON (
q_0."orders__o_clerk_country" = q_1."orders__o_clerk_country"
OR (
q_0."orders__o_clerk_country" IS NULL
AND q_1."orders__o_clerk_country" IS NULL
)
)
ORDER BY
3 DESC
LIMIT
5000
Hi! Based on the Cube roadmap (https://github.com/cube-js/cube/issues/8486), we understand that you are currently working on post-aggregate metrics and dimensions.
We found a guide on how to calculate cohort retention (https://cube.dev/docs/guides/recipes/analytics/cohort-retention). However, the method described there is not dynamic and consumes significant resources due to the need for joins that multiply the dataset several times.
Moreover, if we want to segment the data by a dimensions other than date (e.g., device type, date, country, ...), we would need to rewrite the main SQL query.
In our testing of post-aggregate measures, we noticed that properties like
reduce_by
andgroup_by
could potentially solve this issue. However, if Cube applies all filters to each post-aggregate measure, it can cause complications.We suggest adding a specific parameter that allows defining which filters are applied to a particular measure if parameter is set. This would enable truly dynamic cohort analysis.
For example, by adding a property like: