The result of a relational join or semi-join can have the same size as the cartesian product of the two inputs, in the worst case. Thus, it is an upper bound for the result size. Very often, the actual result size will be much smaller, while the upper bound is too large to be allocated. In fact, many joins encountered in practice are N:1 join (e.g., primary-key foreign-key joins). In such cases, one input of the (semi-)join has unique keys. Then, the result size is upper-bounded by the other input.
Currently, DAPHNE's innerJoin-kernels always allocates the size of the cartesian product (src/local/runtime/kernels/innerJoin.h, line 92: const size_t totalRows = numRowRhs * numRowLhs;), and the semiJoin-kernel always allocates the size of the left-hand-side input (src/runtime/local/kernels/semiJoin.h, line 75: res = DataObjectFactory::create<Frame>(numArgLhs, 1, schema, nullptr, false); and 79). Both should be improved. We need a way to prevent DAPHNE from allocating the size of the cartesian product for join results for N:1 joins.
As a quick fix, we want to add an optional parameter to both join variants that allows users to specify the number of result rows to allocate. In DaphneDSL, when A and B are frames, it should be possible to write:
f1 = innerJoin(A, B, "a.fk", "b.pk"); # allocates the size of AxB for the result
f2 = innerJoin(A, B, "a.fk", "b.pk", nrows(A)); # allocates the size of A for the result
f3 = semiJoin(A, B, "a.fk", "b.pk"); # allocates the size of AxB for the result
f4 = semiJoin(A, B, "a.fk", "b.pk", nrows(A)); # allocates the size of A for the result
Hints:
Add an optional parameter to the DaphneDSL built-in functions in src/parser/daphnedsl/DaphneDSLBuiltins.cpp. If the parameter is not provided in DaphneDSL, a -1 should be passed to the DaphneIR operations.
Add an additional mandatory argument to the InnerJoinOp and SemiJoinOp in DaphneIR.
Adapt the innerJoin and semiJoin-kernels to use this additional argument; if it is -1 allocate the cartesian product, otherwise use the specified size for the result.
Add script-level test cases.
We should go for a quick fix now since need this to work soon, but a better long-term solution could be:
Don't allocate the result of joins (and other operations, e.g., filters) upfront, but create it chunk-by-chunk.
Automatically detect N:1 joins in the DAPHNE compiler and/or runtime kernels. This could be achieved through a "unique" data property.
The result of a relational join or semi-join can have the same size as the cartesian product of the two inputs, in the worst case. Thus, it is an upper bound for the result size. Very often, the actual result size will be much smaller, while the upper bound is too large to be allocated. In fact, many joins encountered in practice are N:1 join (e.g., primary-key foreign-key joins). In such cases, one input of the (semi-)join has unique keys. Then, the result size is upper-bounded by the other input.
Currently, DAPHNE's
innerJoin
-kernels always allocates the size of the cartesian product (src/local/runtime/kernels/innerJoin.h
, line 92:const size_t totalRows = numRowRhs * numRowLhs;
), and thesemiJoin
-kernel always allocates the size of the left-hand-side input (src/runtime/local/kernels/semiJoin.h
, line 75:res = DataObjectFactory::create<Frame>(numArgLhs, 1, schema, nullptr, false);
and 79). Both should be improved. We need a way to prevent DAPHNE from allocating the size of the cartesian product for join results for N:1 joins.As a quick fix, we want to add an optional parameter to both join variants that allows users to specify the number of result rows to allocate. In DaphneDSL, when
A
andB
are frames, it should be possible to write:Hints:
src/parser/daphnedsl/DaphneDSLBuiltins.cpp
. If the parameter is not provided in DaphneDSL, a-1
should be passed to the DaphneIR operations.InnerJoinOp
andSemiJoinOp
in DaphneIR.innerJoin
andsemiJoin
-kernels to use this additional argument; if it is-1
allocate the cartesian product, otherwise use the specified size for the result.We should go for a quick fix now since need this to work soon, but a better long-term solution could be: