Ballista: Fix hacks around concurrency=2 to force hash-partitioned joins

andygrove commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

By default, DataFusion uses hash-partitioned joins if concurrency > 1 which led to me adding this hacky code in a couple of places in Ballista.

let config = ExecutionConfig::new().with_concurrency(2); // TODO: this is hack to enable partitioned joins
let mut ctx = ExecutionContext::with_config(config);

Describe the solution you'd like I'm actually not sure what the solution should be, but I would like to be able to tell the context to use hash-partitioned joins, separately from specifying concurrency.

Describe alternatives you've considered None

Additional context This code is running in the scheduler, not in the executor where the query actually executes. The scheduler concurrency should not impact how the query is planned.

Dandandan commented 3 years ago

I am wondering if the current 2 should just be based on a configation setting for the default number of partitions (just like Spark uses 200 partitions as a default). Maybe we should clean up the terminology wrt concurrency and number of partitions a bit in that case

andygrove commented 3 years ago

:100: That sounds great to me.

apache / datafusion-ballista

Ballista: Fix hacks around concurrency=2 to force hash-partitioned joins #20