[SL-1628] Update dataflow plan to push predicates down to semantic model source queries where possible

tlento commented 5 months ago

Given a list of predicate filters, push each one down to its source semantic model if it is safe to do so.

At this time, we will support pushdown for predicates which are strictly limited to metric_time OR have all linkable specs (dimensions/entities/etc. in the expression) originating from the same semantic model.

For example (raw names used for readability):

Ineligible: booking__is_instant OR listing__is_lux
Ineligible: booking__is_instant AND (listing__is_lux OR listing__is_active)
Equivalent to 2, but both eligible for pushdown: ["booking__is_instant", "listing__is_lux OR listing__is_active"]
Eligible if the only semantic model involved is bookings, but similar to 2 splitting this into two elements in a list would make this eligible for pushdown in all cases: booking__is_instant OR booking AND metric_time >= '2021-01-1' AND metric_time <= '2021-01-31'
Eligible: (metric_time >= '2021-01-01' AND metric_time <= '2021-01-31') OR metric_time BETWEEN '2022-01-01' AND '2022-01-31'

The key is the input referenced in the predicates must all map onto elements in the same underlying semantic models. This may need to be restricted to dimensions only at first, as entities can be more complicated to evaluate without ambiguity, but ideally things will align in a fairly straightforward manner.

Some notes on implementation:

Best current path appears to be to use the group by element resolver to map predicates to semantic models, and push down any predicate that maps on to exactly one semantic model (with the appropriate handling for metric_time). The pushdown can directly target the SourceNode construction, so effectively the query will request a node set that never sees the filtered rows.
We currently do a join -> filter -> aggregate process. Pushdown will change this to pushdown filter -> join -> other filters -> aggregate. The issue here is the results my differ on outer join types because the join step can produce NULL outputs in filter columns, particularly for those that have been pushed down. So someone may request results with a filter of listing IS NOT NULL and we'll happily give them NULL listing outputs in their group by - but not all of them. For this reason, we should re-apply the pushed down filters post-join in the initial implementation, and look to optimize out any pure redundancies in a follow up.
At this time we should not attempt to push down time dimension predicates if metrics involving time windows are involved (cumulative, derived offset, or conversion metrics), as this could inappropriately truncate input.

_{From SyncLinear.com | SL-1628}

Jstein77 commented 3 months ago

in progress

Jstein77 commented 2 months ago

@tomkit.lento is this one still in progress?

Jstein77 commented 1 month ago

Predicate pushdown update - tl,dr; 1 week spike on post-plan-building optimizer approach, followed by a decision on whether to take another to finish up the optimizer-based solution or to cut it off and move on to other priorities. The problem here is categorical dimension pushdown does some seriously stupid shit with simple metric queries. Example SQL post-pushdown:

select sum(bookings) as instant_bookings
from (
  select bookings, bookings__is_instant
  from (
    select 1 as bookings, is_instant as bookings__is_instant
    from bookings
  ) a
  where bookings__is_instant
) b
where bookings__is_instant

There's an easy hack we can put in place to disable it for the most obvious scenarios (mf query --metrics instant_bookings), which is to skip pushdown for queries sourced out of a single semantic model, but that won't cover slightly more complex but still fairly obvious cases (mf query -metrics instant_bookings,listings). A complete solution, which involves moving the predicates instead of replicating them, can be done via a DataflowPlanOptimizer. Doing this requires the following:

Robust test coverage of existing join+filter scenarios (in progress, must happen no matter what)
Consolidation of all pre-existing time-related pushdown operations (90% finished)
A new optimizer that replaces our pushdown operations with a centralized handler (long pole)

After talking to @Jordan we decided to do the following:

I will spike on the optimizer approach with a tentative plan to present a prototype at Tuesday's MetricFlow team meeting. If it's still in progress I'll present what I've got and push to Thursday standups
After Thursday standups we will make a call based on this experience about whether to finish up the optimizer now or add the single-semantic-model hack to the existing pushdown operation and roll it out and put the complete solution on our backlog in favor of focusing on custom calendar work

We've decided to spike on this because:

The optimizer approach will handle pushdown correctly - no more weird duplicate where filters, etc.
The pushdown consolidation required for the optimizer to work will encapsulate our filter granularity adjustment logic, which will help speed custom calendar development as well. Note this consolidation will happen whether we are able to do the optimizer or not so part of this week is effectively custom calendar pre-work.
The optimizer approach will make it easier to expand our pushdown operations to more input types (entities, time dimensions), and, when we get around to enabling more robust filter expressions, will make it easier to be more aggressive about what types of filters we push down

If you have questions or concerns, fire away!

tlento commented 3 weeks ago

Follow-ups still pending:

Support entity filters. This has to account for the fact that entities may be defined in multiple semantic models, and different path designations mean different things in terms of whether and how the filters should be applied.
Support metric_time filters. These will need to be separated - we should not be trying to reason about whether or not we can push down metric_time filters alongside categorical filters. This means case 4 will basically not be eligible for pushdown any longer. The reason for this is it's risky to propagate metric_time filters to the dimension side of a join. While such joins are rare they do exist - think about a unique_users metric keyed on registration_date and defined on a count measure against a dim_users dataset. We won't want to filter out users based on their registration_dates if someone wants the visits metric by user__country.
Expand availability of the list-of-filters interface so people can actually control predicate pushdown at query time

dbt-labs / metricflow

[SL-1628] Update dataflow plan to push predicates down to semantic model source queries where possible #1011