Open lewiszlw opened 3 weeks ago
I think it is a good chance to add documentation that helps yourself and others, or refactor the code to improve the codebase.
Yeah, I will try. I'm not familiar with the whole code yet, I'll keep reading.
This would be a valuable improvement. Now the execution behavior is determined by AggregateMode
+ other more subtle inner states like spill_state.xxx
, and some functions will handle several less related paths at the same time, it's a good idea to separate them (or at least add more doc)
However if it's a big refactor, it's better to wait until https://github.com/apache/datafusion/pull/11943 gets merged to avoid conflicts
This would be a valuable improvement. Now the execution behavior is determined by
AggregateMode
+ other more subtle inner states likespill_state.xxx
, and some functions will handle several less related paths at the same time, it's a good idea to separate them (or at least add more doc) However if it's a big refactor, it's better to wait until #11943 gets merged to avoid conflicts
I think at least we should separate partial and the teminals(Final, Single...). I am making a poc trying to improve the performance of partial, and I found it is too hard to make it, because all modes are mixed up... I am worried that it will become unmaintainable in future...
I was also reading the aggregation code and came up with one possible simplification. What do you think about the idea of a change like this https://github.com/apache/datafusion/pull/12517 I think it helps simplify the logic as it removes the is_stream_merging
flag and lets that be handled as just another aggregation with final mode.
I was also reading the aggregation code and came up with one possible simplification. What do you think about the idea of a change like this #12517 I think it helps simplify the logic as it removes the
is_stream_merging
flag and lets that be handled as just another aggregation with final mode.
I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.
I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.
Yeah, I guess it a bit of a double edge sword that it just moves the complexity elsewhere. But I could think well together with your suggestion of
I think at least we should separate partial and the teminals(Final, Single...).
Because without doing something like this the partial would still need all the logic for the terminal. Or at least my understand is that the stream merging is basically just performing a terminal aggregation.
I guess it may be a bit confused? Because an aggr operator wraps another aggr operator.
Yeah, I guess it a bit of a double edge sword that it just moves the complexity elsewhere. But I could think well together with your suggestion of
I think at least we should separate partial and the teminals(Final, Single...).
Because without doing something like this the partial would still need all the logic for the terminal. Or at least my understand is that the stream merging is basically just performing a terminal aggregation.
Maybe we can refactor the codes thoroughly after https://github.com/apache/datafusion/pull/11943 (which can obviously improve aggr performance) is merged.
I think the partial
and other terminal
s are more reasonable and maintainable to be split into two different structs, and placed into different files.
Because obviously that, some paths only belong to partial aggr
:
And some paths only belong to terminal
s:
Is your feature request related to a problem or challenge?
I'm reading aggregation code recently, found it's hard to read due to different
AggregateMode
, these modes are mixed up in the code.Describe the solution you'd like
Might split stream based on
AggregateMode
?Describe alternatives you've considered
No response
Additional context
No response