apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.27k stars 1.19k forks source link

Fusing partial aggregation with repartition #12596

Open Rachelint opened 1 month ago

Rachelint commented 1 month ago

Is your feature request related to a problem or challenge?

I impl a poc https://github.com/apache/datafusion/pull/12526, and found this idea can actually improve performance.

But for some reasons stated in https://github.com/apache/datafusion/issues/11680#issuecomment-2368735093

I think this improvement is not so suitable to be pushed forward currently.

Just file an issue to track it.

Describe the solution you'd like

Describe alternatives you've considered

No response

Additional context

No response

Rachelint commented 1 month ago

take

Rachelint commented 1 month ago

I will push this forward after tasks listed in https://github.com/apache/datafusion/issues/11680#issuecomment-2368735093 finished

Rachelint commented 1 month ago

@waynexia may be also interested about this?

yjshen commented 1 month ago

Introduce the partitioned hashtable in partial aggregation, and we partition the datafusion before inserting them into hashtable. And we push them into final aggregation partition by partition after, rather than split them again in repartition, and merge them again in coalesce.

I'm not clear on how this proposal works. Could you please explain why it provides performance benefits compared to partial aggregation, exchange, and final aggregation? Is the proposal aimed explicitly at accelerating high cardinality aggregation, or is it intended to enhance aggregation performance?

Rachelint commented 1 month ago

Introduce the partitioned hashtable in partial aggregation, and we partition the datafusion before inserting them into hashtable. And we push them into final aggregation partition by partition after, rather than split them again in repartition, and merge them again in coalesce.

I'm not clear on how this proposal works. Could you please explain why it provides performance benefits compared to partial aggregation, exchange, and final aggregation? Is the proposal aimed explicitly at accelerating high cardinality aggregation, or is it intended to enhance aggregation performance?

I think it enhances aggregation performance generally?

After using partitioned approach in GroupValues and GroupAccumulator:

jayzhan211 commented 1 month ago

I think our goal is to combine partial + repartition + final into single operator, and partial + repartition fusing is the first step of this. After that we could try doing final aggr step as well.

Rachelint commented 1 month ago

I think our goal is to combine partial + repartition + final into single operator, and partial + repartition fusing is the first step of this. After that we could try doing final aggr step as well.

Yes, it may be attractive if we combine them by someway, we seem to have chance to do more optimizations. 🤔

It seems to be expensive for bytes and string? Maybe we can pass the internal states directly to final, and avoid some copies?