dbt-labs / metricflow

MetricFlow allows you to define, build, and maintain metrics in code.
https://docs.getdbt.com/docs/build/about-metricflow
Other
1.14k stars 94 forks source link

[Feature] Support 2-step aggregation for metrics #653

Open yu-iskw opened 1 year ago

yu-iskw commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

As we discussed on Slack, we figured out derived metrics doesn't support to aggregate metrics in derived metrics. So, it would get better to support to aggregating operations in derived metrics.

We consider if we calculate the number of users by the number of page views who the users access for a certain duration from access logs. First, we calculate the number of page views per user. Then we calculate the number of users per the calculated page views per user.

WITH page_views_by_user (
  SELECT
    user_id,
    COUNT(*) AS page_views
  FROM accesslog
  WHERE DATE(event_time) >= DATE("2023-01-01") -- The duration can be parameterized
  GROUP BY 1
)

SELECT
  page_views,
  COUNT(*) AS num_users
FROM page_views_by_user
GROUP BY 1
ORDER BY 1

------
page_views,num_users
1,xxx
2,xxx
3,xxx
4,xxx

Describe alternatives you've considered

We can create another dbt model to calculate the number of page views per user right now. But if we want to change the duration, that might be a bit harder. That’s because we have to calculate the number of page views per users and a certain duration say about day a head. Then, we can sum up them by user to calculate the total number of pages views in the desired duration.

user,date,page_views
1,2023-01-01,2
1,2023-01-02,1
...

And there are other use cases that the precedent approach with creating another model doesn’t work. We assume if we want to calculate the total number of types of items per user on an e-commerce service. We have to count distinct items per user. However, if we want to dynamically change the duration with a condition, the preceding approach doesn’t work. We can’t get the distinct value per user by summing up the data over a finer period of time say about day. So, I think it would be worthwhile to support aggregations in the derived metrics in the future.

WITH unique_items_per_user (
  SELECT
    user_id,
    COUNT(DISTINCT item_id) AS num_unique_items
  FROM orders
  WHERE DATE(purchased_at) >= DATE("2023-01-01") -- The duration can be parameterized
  GROUP BY 1
)

SELECT
  num_unique_items,
  COUNT(user_id) AS num_users
FROM page_views_by_user
GROUP BY 1
ORDER BY 1

------
num_unique_items,num_users
1,xxx
2,xxx
3,xxx
4,xxx

Who will this benefit?

If we support the feature, the feature can enable data analysts who want to flexibly calculate metrics over metrics without creating other models. And as I described above, there are some use cases which the preceding approach doesn't work to get distinctive values. The feature can solves the two main use cases.

Are you interested in contributing this feature?

yes

Anything else?

No response

Jstein77 commented 1 year ago

Thanks for opening this issue @yu-iskw! Adding another example where I think this feature would be useful.

I want to calculate a metric called same-store sales growth, that takes revenue from this month and compares it to the same month last year per store. I then want to take an average of same-store sales growth across all stores as my final metric. The SQL to calculate this metric would look something like:

with monthly_sales as (
select
    a.month,
    a.store_id,
    sum(a.revenue) as revenue_current_month,
    sum(b.revenue) as revenue_prev_year,
    revenue_current_month/ revenue_prev_year as same_store_sales_growth 
from 
    sales a
    left join sales b on a.store_id = b.store_id and date_add(year, -1, a.month) = b.month
group by 
    1,2
),

select
    month,
    avg(same_store_sales_growth) as avg_same_store_sales_growth
from monthly_sales

We can use derived metric offset to calculate the same_store_sales_growth, but can't calculate avg_same_store_sales_growth without being able to further aggregate the metric.

Jstein77 commented 1 year ago

Adding another example from the community slack of how this feature could be used: Question RE derived metrics with the new specs:

We have connects/ disconnects modelled as: A metric gross_users_connected (simple metric using count_distinct) A metric gross_users_disconnected(simple metric using count_distinct) A metric net_users_connected (derived metric: gross_users_connected - gross_users_disconnected ) In this paradigm, gross counts can be aggregated on various time grains (day, week, month, quarter, ...) and they will be valid. The net numbers however, should first be calculated on the day grain, not on coarser grains. This is due to the fact that a user can connect on day1 in the week, disconnect on day2 in the week and connect again on day3. Assume all this behaviour is from the same user, calculating it on day grain would result in: day gross_users_connected gross_users_disconnected net_users_connected
1 1 0 1
2 0 1 -1
3 1 0 1
which if you summed would return week net_users_connected
1 1
Whereas if you would calculate on the week grain immediately you would get: week net_users_connected
1 0

Is there a way to account for this in the new MetricFlow spec or are there other practices we should follow here?

Tom pointed out that a way to accomplish this is to pre-compute connections and disconnecting perday. This requires a second stage of aggregation.

@nhandel @tlento we should do some more design work on what it would take to support 2-step aggregations. It seems like it would unlock alot of metric types.

tlento commented 1 year ago

@Jstein77 agreed. It's not clear to me how much of this should be managed at the level of the derived metric (limited scope re-aggregation) vs at the level of the semantic model. If we have the ability to do derived semantic models that do a second-stage aggregation that might provide a more re-usable base for people to build more complex metrics. I feel like this is also related to the design discussions we were having around semantic models effectively keyed by metric_time.

nhandel commented 1 year ago

@Jstein77 Totally agree. Let's do some design work and post a potential solution. Also very open if anyone else wants to propose a solution.

Jstein77 commented 9 months ago

This use case is related to a limitation raised in the community slack:

It seems measure can only be created on columns within the referred data model. Can I create a measure that requires join? eg. I have stg_order_items and stg_products, and I need to join in product_price from product table to calculate revenue., can I express this join via a metric definition?

Jstein77 commented 9 months ago

Closing in favor of https://github.com/dbt-labs/metricflow/issues/1018