apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

[multistage] consider not relying on column ordering for partitioning #9998

Open agavra opened 1 year ago

agavra commented 1 year ago

Background: imagine we have the two following table schemas:

(A) colA INT, colB STRING
(B) col1 STRING, col2 INT

and I want to issue SELECT * FROM A JOIN B ON A.colA = B.col2 AND A.colB = B.col1. In this case, we will need to hash shuffle the rows to match partitions. In Calcite, we generate a hash distribution with the same join keys:

leftExchange = LogicalExchange.create(leftInput, RelDistributions.hash(joinInfo.leftKeys));
rightExchange = LogicalExchange.create(rightInput, RelDistributions.hash(joinInfo.rightKeys));

When we do this, Calcite will order the join keys passed into RelDistributions.hash to be in ascending order, even if we passed them in with a particular ordering. In the example above, we would have called:

RelDistributions.hash(1, 2); // for A, because join keys are in order
RelDistributsions.hash(2, 1); // for B, because we want col2 to be the first join key

But calcite reorders them to both be [1, 2].

This causes a problem when we hash our keys. Imagine we had the following rows:

(a) colA: 1, colB: "foo"
(b) col1: "foo", col2: 1

If we simply use the column ordering that calcite provides for the hash exchange, we may not hash these two rows into the same partition (hash(1, "foo") and hash("foo", 1) produce different results).

9996 provides a workaround for the solution by using a hash code algorithm that intentionally generates collisions independent of the ordering of the columns (using hash code addition).

This is likely an acceptable solution in the long run so long as the distribution of the hash codes is semi-random even after addition (some initial experimentations show that it is). A better solution, however, would be to fix calcite to maintain the join key ordering so that we can just hash the exact keys.

walterddr commented 1 year ago

we might be able to create our own RelDistributionImpl class and solve it that way.

There seems to be some development that can work around this (for example https://github.com/apache/pinot/pull/11630/ only deal with single column key)

I will follow up and see if there's better options