dbt-labs / metricflow

MetricFlow allows you to define, build, and maintain metrics in code.
https://docs.getdbt.com/docs/build/about-metricflow
Other
1.14k stars 94 forks source link

[SL-2585] Time filters are not applied to the time spine table when join to timespine = true #1489

Open Jstein77 opened 6 days ago

Jstein77 commented 6 days ago

2 bugs here:

  1. When you filter on a metric_time that's not in the group by, it won't be reapplied. This bug I have found.
    1. Planned fix: Apply any metric_time/agg_time filter constraints to the time spine table before joining to the source table. This is necessary because the column to be filtered won't be available after aggregating. Note that this will need to filter any grain, so we might choose a different time spine table for the query if there is a filter on a smaller grain than what is in the group by.
  2. When you filter on a metric_time that IS in the group by, it won't be reapplied. I'm hoping my fix for the prior will also fix this. TODO: try to repro! Do this before the fix on the other one to make sure you don't duplicate work.

🟡 Time filters are not being applied to the time spine table when join to timespine = true. cc @courtney.holcomb @tomkit.lento https://dbt-labs.slack.com/archives/C06UEFM66P5/p17196079781798191.

[June 28th, 2024 1:52 PM] ben.kramer: Friday Afternoon Question - Not Urgent! If i want to produce a saved query grouped by metric_time, is there a way to limit the results to a subset of days when join_to_time_spine is true?

I added filters to my where clause but those filters apply within the subquery (see thread)

From SyncLinear.com | SL-2585

Jstein77 commented 6 days ago

Adding more context to the thread. I created a saved query in jaffle-sl-template/timespine-debug.

The saved query config is below. You can run the following query to see this behavior dbt sl query --saved-query order_cost_time_spine_test

 - name: order_cost_time_spine_test
    query_params:
      metrics: 
        - revenue
      group_by:
        - TimeDimension('metric_time', 'day')
      where:
        # -  "{{TimeDimension('metric_time', 'day')}} >= 
        #     CASE
        #         WHEN EXTRACT(day FROM CURRENT_DATE AT TIME ZONE 'America/New_York') > 14 THEN 
        #             DATEADD(month, 1, DATE_TRUNC(month, CURRENT_DATE AT TIME ZONE 'America/New_York'))
        #         WHEN EXTRACT(day FROM CURRENT_DATE AT TIME ZONE 'America/New_York') <= 14 THEN 
        #             DATE_TRUNC(month, CURRENT_DATE AT TIME ZONE 'America/New_York')
        #         ELSE NULL 
        #     END"
        # -  "{{TimeDimension('order_item__ordered_at_test', 'month')}}>= '2020-01-01'"
        -  "{{TimeDimension('metric_time', 'month')}} = '2020-01-01'"
Jstein77 commented 6 days ago

Let's see what we find when Diego is back and then make a call, but we might be able to at least get this to a point where they can work around it within a week or two.

Jstein77 commented 6 days ago

Oh wait, this doesn't require predicate pushdown (as long as the metric time grains match between filter and group by). We just have to fully support the list interface, which is separate but related to predicate pushdown. If predicate pushdown worked I'd push that up next but I'm a little reluctant to make the change first and then try to debug.

Jstein77 commented 6 days ago

That one is actually the problem - their filter is singular even though they've split it up.

With a more complete predicate pushdown implementation we could push it down, then do the filter + join, then re-apply the other filters (although it's difficult). But predicate pushdown apparently doesn't work.

Jstein77 commented 6 days ago

It also doesn't correctly filter the time spine when I have a filter on another time dimension other than metric time at a monthly grain, and a filter on metric time at a monthly grain. For example this saved query yields the following SQL

I would expect the metric time filter to still be applied to the outer subquery, and the other time filter to be in the inner subquery unless it's included in the group by

Jstein77 commented 6 days ago

Yea agreed that that one doesn't make sense if it's not in the grouping set.

Jstein77 commented 6 days ago

If we focus on this as the filter:

{{TimeDimension('points_ledger_transaction_id__rent_cycle_month', 'month')}} >= whatever

I don't know what they expect out of a filter like this + a time spine fill nulls join operation, which is where the extra rows (all of which have NULL for that time dimension value) are coming from. Essentially, applying that filter after a time spine join might undo the fill nulls they've requested. If they group by the column, we know they expect us to discard rows that "miss" the join to the filtered set.

Jstein77 commented 6 days ago

It also doesn't correctly filter the time spine when I have a filter on another time dimension other than metric time at a monthly grain, and a filter on metric time at a monthly grain. For example this saved query yields the following SQL

What is correct here?

The issue Ben is running into is basically unresolvable - he's adding a query-time filter on a dimension that isn't in the grouping set - and it's not clear what the user's intent is in the first place with the fill nulls with and time spine join setting.

Jstein77 commented 6 days ago

I'm actually not sure why the second query didn't filter the time spine correctly 🤔 since it looks like they are querying everything at a monthly grain.

Jstein77 commented 6 days ago

If you look at Ben's query he's actually using the default grain for metric time, its the other dimension points_ledger_transaction_id__rent_cycle_month that's causing the issue here. He won't be able to query that at a daily grain since the minimum grain is month. If we fix filtering when metric time and another time dimension at a diffrent grain are included in the query it should resolve that issue for him.

 - name: rent_points_earned_daily_cumulative_in_rent_cycle
    description: "Amount of points earned on Rent by day, cumulative"
    label: Cumulative Rent Points Earned by Day Within Current Rent Cycle
    query_params:
      metrics:
        - cumulative_earned_point_amount
      group_by: 
        - TimeDimension('metric_time')
      where: 
        - "{{TimeDimension('points_ledger_transaction_id__rent_cycle_month', 'month')}} >= 
            case
              when extract(day from current_date('America/New_York')) > 14 then date_add(date_trunc(current_date('America/New_York'), month), interval 1 month)
              when extract(day from current_date('America/New_York')) <= 14 then date_trunc(current_date('America/New_York'),  month)
            else null 
            end"
        - "{{TimeDimension('metric_time')}}<= current_date('America/New_York')"
        - "{{TimeDimension('metric_time')}}<= 
            case
              when extract(day from current_date('America/New_York')) > 14 then date_add(date_trunc(current_date('America/New_York'), month), interval 15 day)
              when extract(day from current_date('America/New_York')) <= 14 then date_add(date_add(date_trunc(current_date('America/New_York'),  month), interval -1 month), interval 15 day)
            else null 
            end"
Jstein77 commented 6 days ago

I feel like we should fix this, it feels a bit broken to not be able to filter the output of a cumulative metric or join to timespine metric by month or another grain that's not included in your query.

Jstein77 commented 6 days ago

It also doesn't correctly filter the time spine when I have a filter on another time dimension other than metric time at a monthly grain, and a filter on metric time at a monthly grain. For example this saved query yields the following SQL

 query_params:
      metrics: 
        - revenue
      group_by:
        - TimeDimension('metric_time', 'day')
      where:
        # -  "{{TimeDimension('metric_time', 'day')}} >= 
        #     CASE
        #         WHEN EXTRACT(day FROM CURRENT_DATE AT TIME ZONE 'America/New_York') > 14 THEN 
        #             DATEADD(month, 1, DATE_TRUNC(month, CURRENT_DATE AT TIME ZONE 'America/New_York'))
        #         WHEN EXTRACT(day FROM CURRENT_DATE AT TIME ZONE 'America/New_York') <= 14 THEN 
        #             DATE_TRUNC(month, CURRENT_DATE AT TIME ZONE 'America/New_York')
        #         ELSE NULL 
        #     END"
        -  "{{TimeDimension('order_item__ordered_at_test', 'month')}}>= '2020-01-01'"
        -  "{{TimeDimension('metric_time', 'day')}}>= '2020-01-01'"

Compiled SQL

SELECT
  metric_time__day
  , COALESCE(revenue, 0) AS revenue
FROM (
  SELECT
    subq_7.date_day AS metric_time__day
    , subq_5.revenue AS revenue
  FROM ANALYTICS.dbt_jstein.metricflow_time_spine subq_7
  LEFT OUTER JOIN (
    SELECT
      metric_time__day
      , SUM(revenue) AS revenue
    FROM (
      SELECT
        DATE_TRUNC('month', cast(ordered_at as DATETIME)) AS order_item__ordered_at_test__month
        , DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
        , product_price AS revenue
      FROM ANALYTICS.dbt_jstein.order_items order_item_src_10000
    ) subq_2
    WHERE ( order_item__ordered_at_test__month>= '2020-01-01' ) AND ( metric_time__day>= '2020-01-01' )
    GROUP BY
      metric_time__day
  ) subq_5
  ON
    subq_7.date_day = subq_5.metric_time__day
) subq_8
LIMIT 100
Jstein77 commented 6 days ago

We can do anything, it's a question of whether or not it's worth the effort.

Jstein77 commented 6 days ago

Yeah actually in looking at this further I don't think we can just update the conditional. We'd have to sub-query the filter around the coarser grain and special case the dataflow plan layout around these time granularities. That's kind of nasty. It can be done, but it's super annoying.

The other thing they can do is not use metric_time__month in a filter when grouping by metric_time__day. It's a little more verbose in the filter expression but that seems fine.

Or it would be fine if they didn't have to do {{ TimeDimension('metric_time') }} every time they want to refer to metric_time.

Jstein77 commented 6 days ago

Could we also pull through the monthly grain in the outer query so we're able to add the filter?

SELECT
  metric_time__day
  , COALESCE(revenue, 0) AS revenue
FROM (
  SELECT
    subq_7.date_day AS metric_time__day
    , subq_5.revenue AS revenue
  FROM ANALYTICS.dbt_jstein.metricflow_time_spine subq_7
  LEFT OUTER JOIN (
    SELECT
      metric_time__day
      , metric_time__month
      , SUM(revenue) AS revenue
    FROM (
      SELECT
        DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
        , DATE_TRUNC('month', cast(ordered_at as DATETIME)) AS metric_time__month
        , product_price AS revenue
      FROM ANALYTICS.dbt_jstein.order_items order_item_src_10000
    ) subq_2
    GROUP BY
      metric_time__day,
      metric_time__month
  ) subq_5
  ON
    subq_7.date_day = subq_5.metric_time__day
 WHERE  metric_time__month>= '2020-01-01'
) subq_8
LIMIT 100
Jstein77 commented 6 days ago

It makes sense when you look at the compiled SQL, because we don't select metric_time__month in the out query, so the only place to apply the grain is the inner query.

SELECT
  metric_time__day
  , COALESCE(revenue, 0) AS revenue
FROM (
  SELECT
    subq_7.date_day AS metric_time__day
    , subq_5.revenue AS revenue
  FROM ANALYTICS.dbt_jstein.metricflow_time_spine subq_7
  LEFT OUTER JOIN (
    SELECT
      metric_time__day
      , SUM(revenue) AS revenue
    FROM (
      SELECT
        DATE_TRUNC('day', cast(ordered_at as DATETIME)) AS metric_time__day
        , DATE_TRUNC('month', cast(ordered_at as DATETIME)) AS metric_time__month
        , product_price AS revenue
      FROM ANALYTICS.dbt_jstein.order_items order_item_src_10000
    ) subq_2
    WHERE metric_time__month>= '2020-01-01'
    GROUP BY
      metric_time__day
  ) subq_5
  ON
    subq_7.date_day = subq_5.metric_time__day
) subq_8
LIMIT 100
LIMIT 100
Jstein77 commented 6 days ago

Just edited it 😏. It the first example he sent they are not, they are group by metric time for a cumulative metric, and filtering by a monthly metric.

Jstein77 commented 6 days ago

Uh, those are the same queries. But the issue makes sense to me now. The work-around for now is to also group by the metric_time at the same grain as in the query filter. We can refine the logic for whether or not to re-apply the filter.

Jstein77 commented 6 days ago

Wait. Are they grouping by the same grain?

Jstein77 commented 6 days ago

This saved query will not apply the filter to the time spine table

- name: order_cost_time_spine_test
    query_params:
      metrics: 
        - revenue
      group_by:
        - TimeDimension('metric_time', 'day')
      where:
        -  "{{TimeDimension('metric_time', 'month')}}>= '2020-01-01'"

This saved query will

- name: order_cost_time_spine_test
    query_params:
      metrics: 
        - revenue
      group_by:
        - TimeDimension('metric_time', 'day')
      where:
        -  "{{TimeDimension('metric_time', 'day')}}>= '2020-01-01'"