apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.02k stars 1.14k forks source link

Improve aggregation code readability #12335

Open lewiszlw opened 3 weeks ago

lewiszlw commented 3 weeks ago

Is your feature request related to a problem or challenge?

I'm reading aggregation code recently, found it's hard to read due to different AggregateMode, these modes are mixed up in the code.

Describe the solution you'd like

Might split stream based on AggregateMode?

Describe alternatives you've considered

No response

Additional context

No response

jayzhan211 commented 3 weeks ago

I think it is a good chance to add documentation that helps yourself and others, or refactor the code to improve the codebase.

lewiszlw commented 3 weeks ago

Yeah, I will try. I'm not familiar with the whole code yet, I'll keep reading.

2010YOUY01 commented 3 weeks ago

This would be a valuable improvement. Now the execution behavior is determined by AggregateMode + other more subtle inner states like spill_state.xxx, and some functions will handle several less related paths at the same time, it's a good idea to separate them (or at least add more doc) However if it's a big refactor, it's better to wait until https://github.com/apache/datafusion/pull/11943 gets merged to avoid conflicts

Rachelint commented 2 weeks ago

This would be a valuable improvement. Now the execution behavior is determined by AggregateMode + other more subtle inner states like spill_state.xxx, and some functions will handle several less related paths at the same time, it's a good idea to separate them (or at least add more doc) However if it's a big refactor, it's better to wait until #11943 gets merged to avoid conflicts

I think at least we should separate partial and the teminals(Final, Single...). I am making a poc trying to improve the performance of partial, and I found it is too hard to make it, because all modes are mixed up... I am worried that it will become unmaintainable in future...

eejbyfeldt commented 2 weeks ago

I was also reading the aggregation code and came up with one possible simplification. What do you think about the idea of a change like this https://github.com/apache/datafusion/pull/12517 I think it helps simplify the logic as it removes the is_stream_merging flag and lets that be handled as just another aggregation with final mode.

Rachelint commented 2 weeks ago

I was also reading the aggregation code and came up with one possible simplification. What do you think about the idea of a change like this #12517 I think it helps simplify the logic as it removes the is_stream_merging flag and lets that be handled as just another aggregation with final mode.

I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.

eejbyfeldt commented 2 weeks ago

I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.

Yeah, I guess it a bit of a double edge sword that it just moves the complexity elsewhere. But I could think well together with your suggestion of

I think at least we should separate partial and the teminals(Final, Single...).

Because without doing something like this the partial would still need all the logic for the terminal. Or at least my understand is that the stream merging is basically just performing a terminal aggregation.

Rachelint commented 2 weeks ago

I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.

Yeah, I guess it a bit of a double edge sword that it just moves the complexity elsewhere. But I could think well together with your suggestion of

I think at least we should separate partial and the teminals(Final, Single...).

Because without doing something like this the partial would still need all the logic for the terminal. Or at least my understand is that the stream merging is basically just performing a terminal aggregation.

Maybe we can refactor the codes thoroughly after https://github.com/apache/datafusion/pull/11943 (which can obviously improve aggr performance) is merged.

I think the partial and other terminals are more reasonable and maintainable to be split into two different structs, and placed into different files. Because obviously that, some paths only belong to partial aggr:

And some paths only belong to terminals: