apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
740 stars 144 forks source link

Improve performance of standard deviation aggregate #824

Open andygrove opened 1 month ago

andygrove commented 1 month ago

What is the problem the feature request solves?

I am benchmarking with this query (based on TPC-DS q39):

select w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy
     ,stddev_samp(inv_quantity_on_hand) stdev
from inventory
   ,item
   ,warehouse
   ,date_dim
where inv_item_sk = i_item_sk
  and inv_warehouse_sk = w_warehouse_sk
  and inv_date_sk = d_date_sk
  and d_year =2001
group by w_warehouse_name,w_warehouse_sk,i_item_sk,d_moy;

Results:

AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
stddev                                             3295           3370         105         40.4          24.7       1.0X
stddev: Comet (Scan)                               3343           3461         167         39.9          25.1       1.0X
stddev: Comet (Scan, Exec)                         7621           7657          50         17.5          57.2       0.4X

If I change the query to use avg instead of stddev then I get much better results (ignore the benchmark name):

AMD Ryzen 9 7950X3D 16-Core Processor
TPCDS Micro Benchmarks:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
stddev                                             3187           3276         125         41.8          23.9       1.0X
stddev: Comet (Scan)                               3128           3163          48         42.6          23.5       1.0X
stddev: Comet (Scan, Exec)                         2209           2241          44         60.3          16.6       1.4X

Describe the potential solution

No response

Additional context

No response

eejbyfeldt commented 3 weeks ago

I suspect that implementing the GroupsAccumulator api could resolve this. Took a stab at implementing it here: https://github.com/apache/datafusion/pull/12095 (issue: https://github.com/apache/datafusion/issues/12094) in datafusion. This can the easily be replicated in our version of the aggregates.

andygrove commented 3 weeks ago

Thank you @eejbyfeldt