TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.54k stars 183 forks source link

Replicate Cross Join Situation #371

Closed gregberns closed 2 years ago

gregberns commented 2 years ago

Just started playing with differential again, and trying to understand how to do similar things as Materialize - join relational data in potentially complex ways. I've got some existing queries that I'm using as a 'platform' to play with the lib - and the queries may not be 'good', so trying to understand how this might be done with "Dataflow Thinking".

This is a two part question, but it feels like there's a 'unified solution':

Query 1:

-- postgres sql
-- Context: This can return 10k rows for a single input value
CREATE FUNCTION filter_by_A1_id(param_A1_ID INT) 
  -- For simplicity, we just need to 'filter' `tableB1.id`'s by constraints on `tableA*`
  SELECT a1.id, b1.id
  FROM tableA1 a1
  JOIN tableA2 a2 ON a2.a1_id = a1.id -- this will be a 1:1 relationship

  CROSS JOIN tableB1 b1
  JOIN tableB2 b2 
    ON b2.b1_id = b1.id -- this will be a 1:1 relationship
      -- !! How to handle this?? Where B's rows are constrained by the value returned from A? 
      AND b2.a2_id = a2.id

  WHERE a1.id = param_A1_ID

Query 2:

-- Context: With the constraint on `tableA1` there can be 10k rows generated
SELECT filter_by_A1_id(a1.id)
FROM tableA1 a1
-- Some constraint to provide a subset
WHERE a1.some_date > NOW()

In 'normal' programming, it seems like the values from the A tables could be gathered in Query 1, then passed as parameters into the CROSS JOIN part of the query - and then the tableA* stuff can be removed.

How would this be approached in Differential?

I ran into scope.region which seems like it might be helpful - but haven't wrapped my head around it yet.

frankmcsherry commented 2 years ago

Materialize uses DD exclusively for its data processing, so tbh I'd recommend looking at the plans it generates. You can inspect these from the dataflow visualizer (the "hierarchical memory visualizer" at localhost:6875, or localhost:6876 depending on which version you have).

In your example, you would want to avoid the cross join (generally you want to avoid them) and instead follow the chain of equality constraints. Probably you would filter a1, equijoin with a2, equijoin with b2, then equijoin with b1. No cross join required. MZ will handle all of this reasoning for you, and you can even use EXPLAIN <sql query> for it to show you the join plan it would implement in DD.

Your second query is what Materialize calls a "temporal filter". It works out great there; again I'd check the query plans!

gregberns commented 2 years ago

Awesome, forgot I could flip the joins to get rid of the cross join - worked perfectly.

Thx for the tip to look at the Materialize Explain. After digging through this with DD and the complexity of the queries I need to replicate, sounds like I'm going to be playing with Materialize soon!