If I have two tables A and B and want to join them in some way we currently require loading both tables entirely into memory. If A or B are guaranteed to not change then we could optimize this to a half_join, where we load the non-changing table into memory and stream (avoid arranging) the other. Especially if the streamed data is large this can be more than just a 2x memory improvement. Consider the case of upserting into table, if we want to upsert a single row, today we need to load the entire destination table into memory whereas the half_join operator would allow us to load just the new row, and stream over the destination table.
At the moment what prevents us from doing this are:
A binary join is never planned with a delta join, which means half_join is never rendered
All delta joins must start from an arrangement, without fixing this we are forced to arrange A, even if B is already arranged
Not hydrating delta join paths (which already occurs) is not enough to prevent the arrangement of A, because we would still have to prepare their inputs.
These points are fairly large amounts of work, so we don't expect a single solution, but they serve to capture the current understanding of the problem.
Feature request
If I have two tables A and B and want to join them in some way we currently require loading both tables entirely into memory. If A or B are guaranteed to not change then we could optimize this to a half_join, where we load the non-changing table into memory and stream (avoid arranging) the other. Especially if the streamed data is large this can be more than just a 2x memory improvement. Consider the case of upserting into table, if we want to upsert a single row, today we need to load the entire destination table into memory whereas the half_join operator would allow us to load just the new row, and stream over the destination table.
At the moment what prevents us from doing this are:
These points are fairly large amounts of work, so we don't expect a single solution, but they serve to capture the current understanding of the problem.
cc @frankmcsherry