Closed xiaoyong-z closed 1 year ago
@alamb hello, it seems that datafusion currently doesn't have StreamAggregation. If no one works on this, i want to implement it.
Hi @xiaoyong-z -- that is great news! -- I believe that @metesynnada @ozankabak mentioned they wanted to work on this feature. Let's use this ticket to collaborate on a design.
I believe https://github.com/apache/arrow-datafusion/issues/1570 is also related as streaming grouping is often used to merge the spilled groups. @milenkovicm and I had some discussion about this https://github.com/apache/arrow-datafusion/issues/1570#issuecomment-1247152916 but I never followed through on a writeup
This was on our roadmap and we would love to help out on this. @alamb, if you can share with us the papers/resources you mentioned on this we can digest them and share our thinking on the design. @xiaoyong-z, do you have a particular design in mind yet?
I will begin a google doc for us to collaborate on
Here is a google doc with some ideas https://docs.google.com/document/d/16rm5VR1nGkY6DedMCh1NUmThwf3RduAweaBH9b1h6AY/edit?usp=sharing
I have it in "comment" mode for everyone on the internet, but please feel free to request edit access and I will grant it
Thank you for putting this together, had an initial look. I expect us to do a deeper dive, take a look at the mentioned papers, and give meaningful comments in the next several days.
Our team is also looking forward to this feature and the memory limited aggregation 👍
Thank you all. I'm still in the very beginning stage, and i plan to investigate some papers and how other system implement it in the following days. @alamb thanks for sharing the google doc, i will put my system design plan on it in the future.
I update some plans to implement the stream aggregation on https://docs.google.com/document/d/16rm5VR1nGkY6DedMCh1NUmThwf3RduAweaBH9b1h6AY/edit?usp=sharing
, PTAL. Detail design for fully stream aggregation will be given in the next following days.
PTAL. Detail design for fully stream aggregation will be given in the next following days.
Thank you @xiaoyong-z -- I read your addition and left some comments. Overall I think it is a great idea.
Here is one possibly approach to implementation (perhaps what you had in mind):
StreamAggregate
that handles pre-sorted data (where the data is already sorted according to the grouping keys/ partition keys). AggregateStream
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/aggregates/no_grouping.rs and replace its use in the optimizer with the new StreamAggregate operatorI think that would get us pretty far.
I am not sure about the idea of "sort the data first and then run the stream aggregator" -- as I mentioned in the document I think it is unlikely that approach will be better in terms of overall memory usage or performance.
When we want to support spilling group by (external group by) that is when sort might be beneficial.
@mustafasrepo, can you take a detailed look at @xiaoyong-z's design? Thanks.
Thanks @xiaoyong-z, For the design. I asked some questions to understand the design better, and left some comments. Overall I think, your road map is well thought and planned.
Hi @xiaoyong-z, I can receive some of the tasks from the document. Specifically I would like to start out with the case.
If you are not working already on this feature.
@mustafasrepo Sorry, currently i don't have time to push this work. If you have time on your side, you can work on any part of this feature.
If the input group by columns has been sorted before the aggregation, we can enable stream aggregation, it is more efficient than HashAggregation.