apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.42k stars 183 forks source link

[feature] Support Ballista dynamic shuffle partition number #813

Open Ted-Jiang opened 1 year ago

Ted-Jiang commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Now ballista only have a fixed partition number ballista.shuffle.partitions After setting this the physical distributed plan will always be set to a fixed partition number.

Doc: link

Describe the solution you'd like A clear and concise description of what you want to happen.

restrictions Same values should finally keep on the same partition.

  1. Deal with to many partitions:

targetPostShuffleInputSize default with 256MB: each task will read less than this size We combined them to one single read task.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.