apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.41k stars 1.27k forks source link

[multistage] ShuffleRewriteVisitor Can Allow Shuffle to be Skipped if Data is on Different Servers #9748

Open ankitsultana opened 1 year ago

ankitsultana commented 1 year ago

During the shuffle rewrite phase, at present we only look at the partitioning keys to determine whether we can skip shuffle across two stages. Reference: https://github.com/apache/pinot/blob/master/pinot-query-planner/src/main/java/org/apache/pinot/query/planner/logical/ShuffleRewriteVisitor.java#L185

However, it may be even though the partitioning keys are same, the data is actually on different servers. Things are working fine right now since we don't have partitioning keys in TableScan node.

Once we add partitioning keys in TableScan node, we can easily run into this issue if the two tables involved in a join are on different servers but their partitioning keys and join key are the same.

cc: @walterddr

walterddr commented 1 year ago

+1. the only reason why this util is so simple rely on 2 pre-conditions.

  1. all leaf stages are assumed unpartitioned
  2. all intermediate stages assigned with the same set of server (in fact right now ALL server within the DEFAULT tenant)

any of these 2 assumption break could cause skipShuffle issue.

CC @agavra

61yao commented 1 year ago

Can these two assumptions be checked somewhere in the code and we return an error for this case for now? If not, maybe we should just disable skipShuffle for now? It is better to be slow than to be wrong "sometimes"...