apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

Implement fast min/max accumulator for binary / strings (now it uses the slower path) #6906

Open alamb opened 11 months ago

alamb commented 11 months ago

Is your feature request related to a problem or challenge?

https://github.com/apache/arrow-datafusion/pull/6904 introduces some fancy new hashing and ways to implement aggregates

min/max for strings (StringArray / LargeStringArray, etc) now uses the slower Accumulator implementation which could be made much faster

Describe the solution you'd like

I would like to implement a fast GroupsAccumulator for Min/Max

Describe alternatives you've considered

here is one potential way to implement it:

We could store the current minimum for all groups in the same Rows πŸ€” and track an index into that Rows for the current minimum for each group.

This would require an extra copy of the input values, but it could probably be vectorized pretty well, as shown in the following diagram.

Sorry what I meant was something like the following where the accumulator only stored the current minimum values.

This approach would potentially end up with min_storage being full of "garbage" if many batches had new minumums, but I think we could heuristically "compact" min_storage (if it had 2*num_groups, for example) if it got too large

                                    β”Œ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐  

                                    β”‚           Accumulator             β”‚  
                                                state                      
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  
β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚       β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚         β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚          β”‚ β”Œβ”€β”€β”€β”€β”€β” β”‚     
β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚  A  │─┼────┐  β”‚ β”‚ β”‚  D  β”‚ β”‚    β”Œβ”€β”€β”€β”€β”€β”Όβ”€β”‚  1  β”‚ β”‚  β”‚  
β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚    β”‚    β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚    β”‚     β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚     
β”‚ β”‚  B  β”‚ β”‚       β”‚ β”‚  B  β”‚ β”‚    └──┼─┼▢│  A  β”‚β—€β”Όβ”€β”€β”€β”€β”˜     β”‚ β”‚  0  β”‚ β”‚  β”‚  
β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚         β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚          β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚     
β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚         β”‚          β”‚         β”‚  β”‚  
β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚         β”‚         β”‚          β”‚         β”‚     
β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚  A  β”‚ β”‚       β”‚ β”‚         β”‚          β”‚         β”‚  β”‚  
β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚       β”‚ β”œβ”€β”€β”€β”€β”€β”€ β”‚         β”‚         β”‚          β”‚         β”‚     
β”‚ β”‚  C  β”‚ β”‚       β”‚ β”‚  C  β”‚ β”‚       β”‚ β”‚         β”‚          β”‚         β”‚  β”‚  
β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚       β”‚ β””β”€β”€β”€β”€β”€β”˜ β”‚         β”‚         β”‚          β”‚         β”‚     
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  

  input              input          β”‚ min_storage:         min_values   β”‚  
  values             values           Rows                                 
  (Array)            (Rows)         β”” ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ β”˜  
          step 1:           step 2: for                                    
          convert           any value                   step 3: min value  
          arguments to      that is a new               (per group) is     
          Row format        group                       tracked as an      
                            minimum, copy               index into         
                            it to a                     min_storage `Rows` 
                            second `Rows`                                  

See https://github.com/apache/arrow-datafusion/pull/6800#issuecomment-1622290981 for more details

Additional context

No response

alamb commented 5 months ago

One observation here is that min and max on strings is not that common of an operation from what it seems -- grouping on strings is more common.

Maybe there is some binary usecase where it is important (e.g. embeddings πŸ€” )