dbt-labs / metricflow

MetricFlow allows you to define, build, and maintain metrics in code.
https://docs.getdbt.com/docs/build/about-metricflow
Other
1.14k stars 94 forks source link

[SL-2184] Conversion metrics do not apply filters to input conversion measures #1199

Open tlento opened 6 months ago

tlento commented 6 months ago

In working through enabling predicate pushdown, I discovered that conversion metrics only allow filters to be applied to the base measure, and not the conversion measure.

For example, if I have this:

metrics:
  - name: legal_click_conversions
    description: Click through rate, excluding accounts with a > 20% likelihood of being fraudulent
    type: conversion
    type_params:
      conversion_type_params:
        entity: account
        base_measure:
          name: impressions
          filter: "{{ Dimension('account__fraud_score') }} > 0.2"
        conversion_measure:
          name: clicks
          filter: "{{ Dimension('account__fraud_score') }} > 0.2"
        window: 3 days

as a user I think I would reasonably expect that both the base and conversion measures would throw away any input rows where the account__fraud_score was > 0.2.

What we do in practice is apply the filter to the base measure, but throw it away when configuring the conversion measure, which means we will overstate the rate by some arbitrary amount. Just how much depends on how the filter skews the data.

Since filters like these are often applied specifically to deal with non-representative subsets of data we might be grossly overstating conversions in this case.

This is almost certainly done to avoid applying a pre-aggregation time filter on conversions - in this case, we would need to include the following 3 days of conversion events for every base event, but if the time filter is applied directly to conversion events we may lose those inputs. However, it's a blanket obliteration of the filter predicates from both query input and metric definition, and this can produce incorrect results.

Currently, users can work around this sort of filter issue by making a categorical dimension and marking it as a constant property, but this is, of course, fairly clumsy, and the hidden gotcha of throwing away conversion measure filters seems undesirable.

Where things get tricky, however, is in figuring out what to do about a categorical dimension filter passed down from outside. If a user applies the account__fraud_score filter at query time, where should that be applied? What if they write a single filter expression like "{{ Dimension('account__fraud_score') }} > 0.2 AND {{ Dimension('metric_time') }} BETWEEN ...."

We'll likely need to do the full base -> conversion input join along all filter elements as constant properties, apply the filter against that output, and then do the final aggregations as appropriate.

SL-2184

tlento commented 6 months ago

Worth noting - it's entirely possible that applying filters to just base is adequate because they're applied after the conversion join and before aggregation, so this might simply be a predicate pushdown issue.

This seems unlikely since the conversion metric join is applied to the filtered and aggregated output, which means the filters only get applied correctly if the elements (in the example above, the account__fraud_score) are included in constant_properties.