apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.14k stars 1.16k forks source link

CrossJoin Implementation on (M x N) Partitions #9941

Open berkaysynnada opened 6 months ago

berkaysynnada commented 6 months ago

Is your feature request related to a problem or challenge?

There is a TODO item in CrossJoin: https://github.com/apache/arrow-datafusion/blob/2f550032140d42d1ee6d8ed86f7790766fa7302e/datafusion/physical-plan/src/joins/cross_join.rs#L122

Currently CrossJoin partition count is the partition count of the right child. We can increase parallelism here if allowed.

Describe the solution you'd like

Let's say left has M and right has N partitions, and the target partition is T. We can increase the parallelism by getting the left partitions count to floor[T/N] (assuming T is not smaller than N). If (M x N) is smaller or equal than T, there would be no need to coalesce left partitions also.

Describe alternatives you've considered

-

Additional context

Theoretically, for example, 1x8 partitions of joins does the same amount of unit work with 2x4, but in practice, 2x4 parallelism may be more preferable (I have no solid evidence). So, without changing the target partitions, such kind of parallelism adjustment also be done if it is proved that it works better.

Dandandan commented 18 hours ago

How would this work? At this moment the entire left side is loaded in memory, basically performing a https://en.m.wikipedia.org/wiki/Block_nested_loop with all data loaded into memory. The left child plan can be executed in parallel. Not loading all of the left side seems to require scanning one side more than once?