daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
68 stars 62 forks source link

Join and semi-join with result cardinality hint #901

Open pdamme opened 1 week ago

pdamme commented 1 week ago

The result of a relational join or semi-join can have the same size as the cartesian product of the two inputs, in the worst case. Thus, it is an upper bound for the result size. Very often, the actual result size will be much smaller, while the upper bound is too large to be allocated. In fact, many joins encountered in practice are N:1 join (e.g., primary-key foreign-key joins). In such cases, one input of the (semi-)join has unique keys. Then, the result size is upper-bounded by the other input.

Currently, DAPHNE's innerJoin-kernels always allocates the size of the cartesian product (src/local/runtime/kernels/innerJoin.h, line 92: const size_t totalRows = numRowRhs * numRowLhs;), and the semiJoin-kernel always allocates the size of the left-hand-side input (src/runtime/local/kernels/semiJoin.h, line 75: res = DataObjectFactory::create<Frame>(numArgLhs, 1, schema, nullptr, false); and 79). Both should be improved. We need a way to prevent DAPHNE from allocating the size of the cartesian product for join results for N:1 joins.

As a quick fix, we want to add an optional parameter to both join variants that allows users to specify the number of result rows to allocate. In DaphneDSL, when A and B are frames, it should be possible to write:

f1 = innerJoin(A, B, "a.fk", "b.pk");           # allocates the size of AxB for the result
f2 = innerJoin(A, B, "a.fk", "b.pk", nrows(A)); # allocates the size of A for the result
f3 = semiJoin(A, B, "a.fk", "b.pk");            # allocates the size of AxB for the result
f4 = semiJoin(A, B, "a.fk", "b.pk", nrows(A));  # allocates the size of A for the result

Hints:


We should go for a quick fix now since need this to work soon, but a better long-term solution could be:

saminbassiri commented 1 week ago

Hi, I will work on this issue.

pdamme commented 1 week ago

Thanks, please go ahead!