[Epic] High cardinality aggregation performance wishlist

alamb commented 3 months ago

Is your feature request related to a problem or challenge?

DataFusion uses a two phase approach to aggregation (see Accumulator::state) for details:

                              ▲
                              │                   evaluate() is called to
                              │                   produce the final aggregate
                              │                   value per group
                              │
                 ┌─────────────────────────┐
                 │GroupBy                  │
                 │(AggregateMode::Final)   │      state() is called for each
                 │                         │      group and the resulting
                 └─────────────────────────┘      RecordBatches passed to the
                              ▲
                              │
             ┌────────────────┴───────────────┐
             │                                │
             │                                │
┌─────────────────────────┐      ┌─────────────────────────┐
│        GroubyBy         │      │        GroubyBy         │
│(AggregateMode::Partial) │      │(AggregateMode::Partial) │
└─────────────────────────┘      └────────────▲────────────┘
             ▲                                │
             │                                │    update_batch() is called for
             │                                │    each input RecordBatch
        .─────────.                      .─────────.
     ,─'           '─.                ,─'           '─.
    ;      Input      :              ;      Input      :
    :   Partition 0   ;              :   Partition 1   ;
     ╲               ╱                ╲               ╱
      '─.         ,─'                  '─.         ,─'
         `───────'                        `───────'

For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨‍🍳

However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on https://github.com/apache/datafusion/issues/7957 for why the intermediate cardinality increases

This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )

Together with the StringView work in https://github.com/apache/datafusion/issues/10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries

Also I hear that @avantgardnerio may be interested in helping here