apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.87k stars 1.11k forks source link

Remove redundant RepartitionExec from plan #384

Open andygrove opened 3 years ago

andygrove commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. I was running benchmarks this morning and noticed this fragment of the query plan when running TPC-H query 5.

CoalesceBatchesExec: target_batch_size=4096
  RepartitionExec: partitioning=Hash([Column { name: "n_nationkey" }], 24)
    RepartitionExec: partitioning=RoundRobinBatch(24)
      ParquetExec: batch_size=8192,

It seems redundant to repartition with round-robin and then immediately repartition again using a hash.

A more complex example:

RepartitionExec: partitioning=Hash([Column { name: "r_regionkey" }], 24)
  CoalesceBatchesExec: target_batch_size=4096
    FilterExec: r_name = ASIA
      RepartitionExec: partitioning=RoundRobinBatch(24)
        ParquetExec: batch_size=8192,

In this example, we could push the repartition on r_regionkey down to the scan.

Describe the solution you'd like Implement new optimizer rule to remove redundant repartitions and/or push down repartitions.

Describe alternatives you've considered None

Additional context None

Dandandan commented 3 years ago

I am wondering whether this helps for DataFusion, as Repartition earlier adds some more parallelism and doing a hash repartion before the filter would be more expensive than after the filter.

For Ballista this might be different if the data needs to be shuffled to multiple machines.