Handling the cartesian joins produced by Calcite

wmalpica commented 4 years ago

Sometimes Calcite produces a relational algebra plan that produces a cartesian join. For example: query = """ select o_orderpriority, count() as order_count from orders where o_orderdate >= date '1993-07-01' and o_orderdate < date '1994-10-01' and exists (select from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate) group by o_orderpriority order by o_orderpriority"""

Produces the following relational algebra plan:

LogicalSort(sort0=[$0], dir0=[ASC]) LogicalAggregate(group=[{0}], order_count=[COUNT()]) LogicalProject(o_orderpriority=[$1]) LogicalFilter(condition=[AND(>=($0, 1993-07-01), <($0, 1993-10-01), IS NOT NULL($2))]) LogicalJoin(condition=[true], joinType=[left]) BindableTableScan(table=[[main, orders]], projects=[[4, 5]]) LogicalProject($f0=[$0]) LogicalAggregate(group=[{}], agg#0=[MIN($0)]) LogicalProject($f0=[true]) LogicalJoin(condition=[=($0, $1)], joinType=[inner]) LogicalProject(l_orderkey=[$0]) LogicalFilter(condition=[<($11, $12)]) LogicalTableScan(table=[[main, lineitem]]) BindableTableScan(table=[[main, orders]], projects=[[0]])

We need to figure out a way to handle this. Either by implementing a full cartesian join, or in the case of this query, one of the two sides of the query has only a single row, in which case we could handle it in some sort of special way for these very special cases

wmalpica commented 4 years ago

Cross joins are now available in cudf. We should just use that: https://github.com/rapidsai/cudf/pull/5327

wmalpica commented 4 years ago

We will still need to figure out how to do this in a distributed context. We are hoping and assuming that the times that we have to implement a cartesian join, that its actually small, otherwise it will be horrible. But if we do assume that the cartesian join will not be too big, we can just scatter one of the two tables. To do this we can follow a similar pattern as for regular inner joins when there is a small table join optimization, where we want to :

Check the size of each table
scatter the smallest of the two
perform the join

Note that the join itself can follow the same PartwiseJoin style

BlazingDB / blazingsql

Handling the cartesian joins produced by Calcite #702