Closed palaska closed 1 month ago
In datafusion, target_partition argument doesn't necessarily increase partition count each time. If DataFusion thinks that executing the query in single partition is better in terms of performance, it will do so even if target_partition number is larger than 1. Do you think, parallelism will improve the performance for this query, if you think so we should definitely increase partition for this query. What is your thoughts in this regard?
In short, setting target_partitions to larger than 1 doesn't necessarily increase partition in datafusion.
In datafusion, target_partition argument doesn't necessarily increase partition count each time. If DataFusion thinks that executing the query in single partition is better in terms of performance, it will do so even if target_partition number is larger than 1. Do you think, parallelism will improve the performance for this query, if you think so we should definitely increase partition for this query. What is your thoughts in this regard?
In short, setting target_partitions to larger than 1 doesn't necessarily increase partition in datafusion.
Thanks for the explanation! I agree that optimizing for performance makes sense, as long as it doesn't compromise guarantees or hurt system predictability. In Ballista, a "task" is generated for each partition, and changing this behavior has caused some unit test assertions to fail. However, I don't think this is a major issue for Ballista. I’m not familiar with how this flag is being used in other systems, but @alamb might have some insights to share.
I think the reason it is called "target_partitions" is that is is not a guarantee but instead is a target used for performance optimizations as @akurmustafa mentions
If you need more than one partition you can always modify the plan / set the required input partitions
Thanks for the clarifications, closing this one.
Describe the bug
target_partitions execution option is ignored when the input has 1 partition. The introduced condition here is the root cause.
To Reproduce
Above code produces a physical plan with a single output partition and the plan doesn't contain a RepartitionExec.
When the input partition number is set to a value > 1, it works as expected. (
multi_partitions
becomestrue
here)Expected behavior
No response
Additional context
No response