dbt-labs / metricflow

MetricFlow allows you to define, build, and maintain metrics in code.
https://docs.getdbt.com/docs/build/about-metricflow
Other
1.14k stars 94 forks source link

[SL-2777] Reproduce error with conversion metric and missing group by #1492

Open Jstein77 opened 1 day ago

Jstein77 commented 1 day ago

@courtney.holcomb said:

We've been seeing this error come up in the logs a lot lately, and I was finally able to repro. It's triggered by querying a conversion metric with a where filter but with no group by. The where filter is only applied to one of the measures, so then we aren't able to join the two aggregated measures. I'm not super familiar with conversion metric behavior. What do we want the experience to be in that scenario to fix this bug? cc @willymwai.deng @jordan.stein

[August 21st, 2024 11:30 AM] diego.fernandez: Hey all, we've seen this error a few times in MFS recently: All parent nodes should have the same set of linkable instances since all values are coalesced. They all seem to come from the same account, so maybe it's a configuration error on their side, but it shows up as errors on our side <https://dbtlabsmt.datadoghq.com/logs?query=%40extra.job.error.message%3A"All parent nodes should have the same set of linkable instances since all values are coalesced."&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice&context_event=AZFy6QYvAAC7NuVb965U8ACb&fromUser=false&graphType=flamegraph&historicalData=true&hostGroup=*&lens=Performance&messageDisplay=inline&page=1&panelFrom=1724209200000&panelStorageType=hot&panelTo=1724210400...

From SyncLinear.com | SL-2777

Jstein77 commented 1 day ago

As a follow up:

Jstein77 commented 1 day ago

we do not have anything that tracks what type of metric is being used. We could see if it gets logged so that we can make that calculation on DD

Jstein77 commented 1 day ago

Do we have any visibility into how often conversion metrics are being queried, and what % hit this error? I checked in our snowflake tables but didn't see anything. I'm trying to prioritize this bug.

Jstein77 commented 1 day ago

Commented on the PR. I don't know if we can quick-fix this, but if you've got ideas I'd be happy to be wrong about that.

Jstein77 commented 1 day ago

That error isn't happening in the pushdown optimizer, it's happening on plan conversion. A query of this form apparently doesn't convert to a sql query plan:

    parsed_query = query_parser.parse_and_validate_query(
        metric_names=("visit_buy_conversion_rate_7days",),
        group_by_names=("metric_time", "user__home_state_latest"),
        where_constraint=PydanticWhereFilter(where_sql_template="{{ Dimension('visit__referrer_id') }} = '123456'"),
    )
Jstein77 commented 1 day ago

Hmm, I made that change and it seems to break the predicate pushdown tests, but other tests seems to be fine

FAILED tests_metricflow/query_rendering/test_predicate_pushdown_rendering.py::test_conversion_metric_query_filters - RuntimeError: Expected exactly one matching instance for LinklessEntitySpec(element_name='user', entity_links=()) in instance set, but found: []. All entity instances: ()

@tomkit.lento mind lending a hand? 👀

Jstein77 commented 1 day ago

This only addresses the bug from Courtney's post and also the issue 1210, but 1199 is a separate issue

Jstein77 commented 1 day ago

Cause this base_measure_recipe.required_local_linkable_specs contains the specs needed to do the joins to get the converted events, but it contains the specs from the where filter too since it was built via __get_required_and_extraneous_linkable_specs and in the final aggregate measures node, we should just use the queried_specs as we don't want any of the specs from the filter to be there which is causing that bug

Jstein77 commented 1 day ago

So trying to remember the context of all this again, but it seems like you can repro this by filtering with a dim that exists in the base measure's semantic model, but that dimension isn't in the group by? I'm looking at the error, it seems like that error is coming from when we cross join the agg'd base measure set and the base measure set filtered by only converted where the base measure set filtered by converted rows seems to have the linkable element (ie., the dimension) so it's erroring out during the cross join rendering since it should not have the spec from the filter.

I did some testing, it seems like this fixes that bug? https://github.com/dbt-labs/metricflow/pull/1381

Jstein77 commented 1 day ago

More details here:

https://github.com/dbt-labs/metricflow/issues/1199

Jstein77 commented 1 day ago

@jordan.stein no, we would not.

Jstein77 commented 1 day ago

Taking a look 👀

Jstein77 commented 1 day ago

I thought if we filter on the base measure set, then join in the conversion set we should only be matching users in the base measure set so wouldn't we be safe from false conversion matches?

Jstein77 commented 1 day ago

I can also repro if i run dbt sl query --metrics mql_to_seller_conversion_rate_base --where "{{Dimension('mql__origin')}} = 'direct_traffic'". In this case mql__origin is from the same semantic model as the base measure

Jstein77 commented 1 day ago

If you only apply to the filter to the base measure you may get inappropriate conversion matches.

Jstein77 commented 1 day ago

I feel like that was intentional but maybe I'm misremembering? @willymwai.deng would know

Jstein77 commented 1 day ago

I think we should apply the filter to the base measure, similar to how to do it for categorical dimensions. i.e The filter only gets applied here to limit the base measure set.

 (
    SELECT
      metric_time__day
      , SUM(mqls) AS mqls
    FROM (
      SELECT
        DATE_TRUNC('day', first_contact_date) AS metric_time__day
        , CASE WHEN mql_id IS NOT NULL THEN 1 ELSE 0 END AS mqls
      FROM ANALYTICS.dbt_jstein.olist_marketing_qualified_leads olist_mqls_src_10000
    ) subq_2
    WHERE metric_time__day > '2010-01-01'
    GROUP BY
      metric_time__day
  ) subq_4
Jstein77 commented 1 day ago

Yep that's what I would expect

Jstein77 commented 1 day ago

I get the error when filtering for metric time

Encountered an error error querying against the semantic layer: All join data sets should have the same set of linkable instances as the from dataset since all values are coalesced.
From dataset instance set: InstanceSet(measure_instances=(MeasureInstance(defined_from=(SemanticModelElementReference(semantic_model_name='olist_mqls', element_name='mqls'),), associated_columns=(ColumnAssociation(column_name='mqls', single_column_correlation_key=SingleColumnCorrelationKey(PYDANTIC_BUG_WORKAROUND=True)),), spec=MeasureSpec(element_name='mqls', non_additive_dimension_spec=None, fill_nulls_with=None), aggregation_state=AggregationState.COMPLETE),), dimension_instances=(), time_dimension_instances=(), entity_instances=(), group_by_metric_instances=(), metric_instances=(), metadata_instances=())
Join dataset instance sets: [InstanceSet(measure_instances=(MeasureInstance(defined_from=(SemanticModelElementReference(semantic_model_name='olist_closed_deals', element_name='sellers'),), associated_columns=(ColumnAssociation(column_name='sellers', single_column_correlation_key=SingleColumnCorrelationKey(PYDANTIC_BUG_WORKAROUND=True)),), spec=MeasureSpec(element_name='sellers', non_additive_dimension_spec=None, fill_nulls_with=None), aggregation_state=AggregationState.COMPLETE),), dimension_instances=(), time_dimension_instances=(TimeDimensionInstance(defined_from=(SemanticModelElementReference(semantic_model_name='olist_mqls', element_name='ds'),), associated_columns=(ColumnAssociation(column_name='metric_time__day', single_column_correlation_key=SingleColumnCorrelationKey(PYDANTIC_BUG_WORKAROUND=True)),), spec=TimeDimensionSpec(element_name='metric_time', entity_links=(), time_granularity=TimeGranularity.DAY, date_part=None, aggregation_state=None)),), entity_instances=(), group_by_metric_instances=(), metric_instances=(), metadata_instances=())]

Polling for queried result set
(cloud-env) jordanstein@Jordan-Stein jaffle-sl-template % dbt sl query --metrics mql_to_seller_conversion_rate_base --where "{{TimeDimension('metric_time')}} > '2010-01-01'"
Jstein77 commented 1 day ago

Actually, wait, I’m thinking of something else. It is a bit odd that it’s only time filters.

But the linked GH issue has a repro case in it, and that uses a metric_time filter as well.

Jstein77 commented 1 day ago

It should only happen with metric_time, for Reasons