Is your feature request related to a problem or challenge?
DataFusion uses a two phase approach to aggregation (see Accumulator::state) for details:
▲
│ evaluate() is called to
│ produce the final aggregate
│ value per group
│
┌─────────────────────────┐
│GroupBy │
│(AggregateMode::Final) │ state() is called for each
│ │ group and the resulting
└─────────────────────────┘ RecordBatches passed to the
▲
│
┌────────────────┴───────────────┐
│ │
│ │
┌─────────────────────────┐ ┌─────────────────────────┐
│ GroubyBy │ │ GroubyBy │
│(AggregateMode::Partial) │ │(AggregateMode::Partial) │
└─────────────────────────┘ └────────────▲────────────┘
▲ │
│ │ update_batch() is called for
│ │ each input RecordBatch
.─────────. .─────────.
,─' '─. ,─' '─.
; Input : ; Input :
: Partition 0 ; : Partition 1 ;
╲ ╱ ╲ ╱
'─. ,─' '─. ,─'
`───────' `───────'
For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨🍳
However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on https://github.com/apache/datafusion/issues/7957 for why the intermediate cardinality increases
This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )
Together with the StringView work in https://github.com/apache/datafusion/issues/10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries
Also I hear that @avantgardnerio may be interested in helping here
Is your feature request related to a problem or challenge?
DataFusion uses a two phase approach to aggregation (see
Accumulator::state
) for details:For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨🍳
However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on https://github.com/apache/datafusion/issues/7957 for why the intermediate cardinality increases
This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )
Together with the StringView work in https://github.com/apache/datafusion/issues/10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries
Also I hear that @avantgardnerio may be interested in helping here
Describe the solution you'd like
Here is my wishlist:
Describe alternatives you've considered
Do nothing and let DuckDB pass us by ;)
Additional context
Other potential things to do: