cube-js / cube

📊 Cube — The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.9k stars 1.77k forks source link

[Feature request] Post-aggregate measure filters #8804

Open itestyoy opened 1 week ago

itestyoy commented 1 week ago

Hi! Based on the Cube roadmap (https://github.com/cube-js/cube/issues/8486), we understand that you are currently working on post-aggregate metrics and dimensions.

We found a guide on how to calculate cohort retention (https://cube.dev/docs/guides/recipes/analytics/cohort-retention). However, the method described there is not dynamic and consumes significant resources due to the need for joins that multiply the dataset several times.

Moreover, if we want to segment the data by a dimensions other than date (e.g., device type, date, country, ...), we would need to rewrite the main SQL query.

In our testing of post-aggregate measures, we noticed that properties like reduce_by and group_by could potentially solve this issue. However, if Cube applies all filters to each post-aggregate measure, it can cause complications.

We suggest adding a specific parameter that allows defining which filters are applied to a particular measure if parameter is set. This would enable truly dynamic cohort analysis.

For example, by adding a property like:

propagate_filters: [dim1, dim2, ...]
paveltiunov commented 2 days ago

@itestyoy Thanks for the feedback here! Could you please elaborate more the use case? Cube data model as you see it and target SQL to be generated would be very helpful.

itestyoy commented 2 days ago

Hi, @paveltiunov!

Cube model:

cube(`orders`, {
  sql_table: `dbt.orders`,

  data_source: `default`,

  dimensions: {
    o_orderstatus: {
      sql: `o_orderstatus`,
      type: `string`
    },

    o_orderpriority: {
      sql: `o_orderpriority`,
      type: `string`
    },

    o_clerk: {
      sql: `o_clerk`,
      type: `string`
    },

    o_clerk_country: {
      sql: `o_clerk_country`,
      type: `string`
    },

    o_orderdate: {
      sql: `o_orderdate`,
      type: `time`
    }
  },

  measures: {
    o_totalprice: {
      sql: `o_totalprice`,
      type: `sum`
    },

    clerk_count: {
      sql: `o_clerk`,
      type: `count_distinct`
    },

    cohort_clerk_count: {
      multi_stage: true,
      sql: `${clerk_count}`,
      type: `sum`,
      group_by: [o_clerk_country]
    }
  },

});

SQL API Query:

SELECT
  orders.o_orderpriority,
  orders.o_clerk_country,
  MEASURE(orders.cohort_clerk_count),
  MEASURE(orders.clerk_count),
  MEASURE(orders.o_totalprice)
FROM
  orders
GROUP BY
  1,
  2
LIMIT
  5000;

Each user has 0 or more orders with different order priorities. This part of the code will calculate the total sum of users for each country:

Copy code
"orders__o_orderpriority", 
"orders__o_clerk_country", 
SUM(SUM("orders__clerk_count")) OVER (PARTITION BY "orders__o_clerk_country") AS "orders__cohort_clerk_count"

However, this is not an accurate user count (the count_distinct function doesn’t help because it calculates distinct counts over grouped data). We can sum the users by country since each user can belong to only one country but not over priority. If we exclude o_orderpriority from cte_0 and cte_1 and from WHERE clause if presented in cte_0 and cte_1, and make an INNER JOIN only by country, we can calculate the real user count in the cohort.

Then we can calculate metrics like ARPU, RR and others.

WITH cte_0 AS (
  SELECT
    "orders".o_orderpriority "orders__o_orderpriority",
    "orders".o_clerk_country "orders__o_clerk_country",
    count(distinct "orders".o_clerk) "orders__clerk_count"
  FROM
    dbt.orders AS "orders"
  GROUP BY
    1,
    2
  ORDER BY
    3 DESC
),
cte_1 AS (
  SELECT
    "orders__o_orderpriority" "orders__o_orderpriority",
    "orders__o_clerk_country" "orders__o_clerk_country",
    sum(sum("orders__clerk_count")) OVER(PARTITION BY "orders__o_clerk_country") "orders__cohort_clerk_count"
  FROM
    (
      SELECT
        q_0."orders__o_orderpriority",
        q_0."orders__o_clerk_country",
        "orders__clerk_count" "orders__clerk_count"
      FROM
        cte_0 as q_0
      ORDER BY
        3 DESC
    ) AS cte_1_join
  GROUP BY
    1,
    2
  ORDER BY
    3 DESC
)
SELECT
  q_0."orders__o_orderpriority",
  q_0."orders__o_clerk_country",
  "orders__cohort_clerk_count" "orders__cohort_clerk_count",
  "orders__clerk_count" "orders__clerk_count",
  "orders__o_totalprice" "orders__o_totalprice"
FROM
  (
    SELECT
      "main__orders".o_orderpriority "orders__o_orderpriority",
      "main__orders".o_clerk_country "orders__o_clerk_country",
      count(distinct "main__orders".o_clerk) "orders__clerk_count",
      sum("main__orders".o_totalprice) "orders__o_totalprice"
    FROM
      dbt.orders AS "main__orders"
    GROUP BY
      1,
      2
  ) as q_0
  INNER JOIN (
    SELECT
      *
    FROM
      cte_1
  ) as q_1 ON (
    q_0."orders__o_orderpriority" = q_1."orders__o_orderpriority"
    OR (
      q_0."orders__o_orderpriority" IS NULL
      AND q_1."orders__o_orderpriority" IS NULL
    )
  )
  AND (
    q_0."orders__o_clerk_country" = q_1."orders__o_clerk_country"
    OR (
      q_0."orders__o_clerk_country" IS NULL
      AND q_1."orders__o_clerk_country" IS NULL
    )
  )
ORDER BY
  3 DESC
LIMIT
  5000

Suggested:

WITH cte_0 AS (
  SELECT
    "orders".o_clerk_country "orders__o_clerk_country",
    count(distinct "orders".o_clerk) "orders__clerk_count"
  FROM
    dbt.orders AS "orders"
  GROUP BY
    1
  ORDER BY
    2 DESC
),
cte_1 AS (
  SELECT
    "orders__o_clerk_country" "orders__o_clerk_country",
    sum(sum("orders__clerk_count")) OVER(PARTITION BY "orders__o_clerk_country") "orders__cohort_clerk_count"
  FROM
    (
      SELECT
        q_0."orders__o_clerk_country",
        "orders__clerk_count" "orders__clerk_count"
      FROM
        cte_0 as q_0
      ORDER BY
        2 DESC
    ) AS cte_1_join
  GROUP BY
    1
  ORDER BY
    2 DESC
)
SELECT
  q_0."orders__o_orderpriority",
  q_0."orders__o_clerk_country",
  "orders__cohort_clerk_count" "orders__cohort_clerk_count",
  "orders__clerk_count" "orders__clerk_count",
  "orders__o_totalprice" "orders__o_totalprice"
FROM
  (
    SELECT
      "main__orders".o_orderpriority "orders__o_orderpriority",
      "main__orders".o_clerk_country "orders__o_clerk_country",
      count(distinct "main__orders".o_clerk) "orders__clerk_count",
      sum("main__orders".o_totalprice) "orders__o_totalprice"
    FROM
      dbt.orders AS "main__orders"
    GROUP BY
      1,
      2
  ) as q_0
  INNER JOIN (
    SELECT
      *
    FROM
      cte_1
  ) as q_1 ON (
    q_0."orders__o_clerk_country" = q_1."orders__o_clerk_country"
    OR (
      q_0."orders__o_clerk_country" IS NULL
      AND q_1."orders__o_clerk_country" IS NULL
    )
  )
ORDER BY
  3 DESC
LIMIT
  5000