MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.71k stars 466 forks source link

Strongly normalizing IR #692

Open jamii opened 4 years ago

jamii commented 4 years ago

Many of our transforms are an attempt to reach some kind of normal form, with joins/filters/maps/projects etc fused and pushed down as far as they can go.

We could instead use a logical IR which doesn't specify the order of these operations:

enum LogicalRelationExpr {
  Predicate{
    inputs: Vec<LogicalRelationExpr>,
    maps: Vec<ScalarExpr>,
    filters: Vec<ScalarExpr>,
    project: Vec<usize>,
  }
  Reduce{...}
  Union{...}
  Branch{...}
}

Then in the transformation to the physical IR, we have one function that greedily takes inputs, maps, filters and projects from the logical IR.

frankmcsherry commented 4 years ago

I like this! It seems that a lot of fragments we end up with are aggregations around joins, and to the extent that we can put and leave things in this form, hooray!

In particular, the physical planning wants the flexibility to eagerly delete columns and apply predicates, even while reordering joins. Several of our “transformations” (the first commented batch) attempt to reach such a normal form (filters around a join blob), and it would be great to think about whether the others can retain this structure.

frankmcsherry commented 4 years ago

For example, chbench query 2 after a small (but surprising) amount of local clean-up, looks like:

Project {
  outputs: [5, 6, 31, 0, 2, 7, 9, 11, 31, 6, 0],
  Filter {
    predicates: [
      #0 = #12,
      #0 = #37,
      #8 = #30,
      #14 = #38,
      #29 = i32toi64 #5,
      #32 = #34,
      #4 ~ /^.*b$/,
      #35 ~ /^EUROP.*$/
    ],
    Join {
      variables: [],
      Get { item },
      Get { supplier },
      Get { stock },
      Get { nation },
      Get { region },
      Reduce {
        group_key: [0],
        aggregates: [min(#2)],
        Filter {
          predicates: [
            #17 = i32toi64 #18,
            #21 = #25,
            #27 = #29,
            #30 ~ /^EUROP.*$/
          ],
          Join {
            variables: [],
            Get { stock },
            Get { supplier },
            Get { nation },
            Get { region }
          }
        }
      }
    }
  }
}

This is with only fusion and elision applied, and a touch of inlining (for gets). It already has something of the appealing structure we are looking for.

However, it might be a bit too simplified already. We would love to keep the Let introduced by the branch operator, as it would help us reduce the complexity there (we are hoping to promote the outer relation out of the join it gets involved in). Unfortunately, without inlining Let the plan looks like:

    EDIT: Snipped

This query might not be the right example, sorry! It gets a sub-optimal plan that could re-use parts from the subquery, but it is actually expressed redundantly in SQL, so this isn't a self-own.

A better example might be query 4, which has an IN and which ends up as

Let {
  tmp_1 = Filter {
    predicates: [
      true,
      datetots #4 < 2012-01-02 00:00:00,
      datetots #4 >= 2007-01-02 00:00:00
    ],
    Get { "order" }
  }
} in
Let { tmp_2 = Distinct { group_key: [0 .. 2, 4], Get { tmp_1 } } } in
Let {
  tmp_3 = Join {
    variables: [],
    Distinct {
      group_key: [0 .. 3],
      Filter {
        predicates: [#0 = #4, #1 = #5, #2 = #6, #10 >= #3],
        Join { variables: [], Get { tmp_2 }, Get { orderline } }
      }
    },
    Constant [[true]]
  }
} in
Project {
  outputs: [0, 1, 0],
  Map {
    scalars: [],
    Reduce {
      group_key: [6],
      aggregates: [countall(null)],
      Filter {
        predicates: [#8],
        Project {
          outputs: [0 .. 7, 12],
          Join {
            variables: [
              [(0, 0), (1, 0)],
              [(0, 1), (1, 1)],
              [(0, 2), (1, 2)],
              [(0, 4), (1, 3)]
            ],
            Get { tmp_1 },
            Union {
              Get { tmp_3 },
              Join {
                variables: [],
                Union {
                  Negate {
                    Distinct { group_key: [0 .. 3], Get { tmp_3 } }
                  },
                  Get { tmp_2 }
                },
                Constant [[false]]
              }
            }
          }
        }
      }
    }
  }
}

In this case, we want to figure out how to migrate tmp_2 upwards to bump into the join with tmp_1 and discard it (the general pattern is X joined with some reduction that contains a join with distinct of X, and we want to lift the distinct of X up to the join with X so it can be cancelled).

This representation actually looks tractable for that (if we can dance around the unions and such).

benesch commented 3 years ago

I believe we got this with the map-filter-project operator (#6339).

frankmcsherry commented 3 years ago

@asenac may want to revive this, as he is specifically interested in a multilinear operator as proposed here. That would be something we do not yet have.

asenac commented 3 years ago

Yes! This is basically the idea behind the query graph model. Normalization/simplification is a lot easier/simpler with a reduced set of logical operations. I think we should revive this and do it after the LIR is pushed. Then, we can fuse these operators into one and move join planning to the code that does the MIR -> LIR transformation.