MaterializeInc / materialize

The data warehouse for operational workloads.
https://materialize.com
Other
5.68k stars 458 forks source link

compute: Expose a pattern that uses or mimics differential's half_join operator #25945

Open ParkMyCar opened 4 months ago

ParkMyCar commented 4 months ago

Feature request

If I have two tables A and B and want to join them in some way we currently require loading both tables entirely into memory. If A or B are guaranteed to not change then we could optimize this to a half_join, where we load the non-changing table into memory and stream (avoid arranging) the other. Especially if the streamed data is large this can be more than just a 2x memory improvement. Consider the case of upserting into table, if we want to upsert a single row, today we need to load the entire destination table into memory whereas the half_join operator would allow us to load just the new row, and stream over the destination table.

At the moment what prevents us from doing this are:

  1. A binary join is never planned with a delta join, which means half_join is never rendered
  2. All delta joins must start from an arrangement, without fixing this we are forced to arrange A, even if B is already arranged
  3. Not hydrating delta join paths (which already occurs) is not enough to prevent the arrangement of A, because we would still have to prepare their inputs.

These points are fairly large amounts of work, so we don't expect a single solution, but they serve to capture the current understanding of the problem.

cc @frankmcsherry

ggevay commented 4 months ago

Connecting some related Slack threads:

teskje commented 4 months ago

This is a more specialized (and easier to implement) version of #25989.