apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
762 stars 151 forks source link

chore: Investigate impact of small batches on performance #495

Closed andygrove closed 1 month ago

andygrove commented 4 months ago

What is the problem the feature request solves?

I added some debug logging to CometNativeIterator to show the size of batches being processed when running TPC-H q14 and I see lots of small batches being processed.

Creating batch with 97 rows
Creating batch with 86 rows
Creating batch with 87 rows
Creating batch with 72 rows
Creating batch with 80 rows
...

The query processes 73,456 batches with fewer than 1000 rows and 2,448 batches with at least 1000 rows.

I wonder if there would be a performance benefit in coalescing these small batches into larger batches (if that is even possible -- I do not have full information on the context yet).

My theory is that we have some overhead per batch and that we could reduce that overhead if we had larger batches. This issue is for analyzing this and writing up some findings.

Describe the potential solution

No response

Additional context

No response

andygrove commented 4 months ago

More detailed debug info

[CometProject [l_partkey#17L, l_extendedprice#21, l_discount#22], [l_partkey#17L, l_extendedprice#21, l_discount#22]] Creating batch with 12 rows

[CometHashAggregate [l_extendedprice#21, l_discount#22, p_type#78], Partial, [partial_sum(CASE WHEN StartsWith(p_type#78, PROMO) THEN (l_extendedprice#21 * (1 - l_discount#22)) ELSE 0.0000 END), partial_sum((l_extendedprice#21 * (1 - l_discount#22)))]] Creating batch with 1 rows

[CometFilter [p_partkey#74L, p_type#78], isnotnull(p_partkey#74L)] Creating batch with 4816 rows
[CometFilter [p_partkey#74L, p_type#78], isnotnull(p_partkey#74L)] Creating batch with 8192 rows
andygrove commented 3 months ago

When running TPC-H q16 with DataFusion, there is a significant difference in performance between runs with coalesce batches enabled vs disabled.

With datafusion.execution.coalesce_batches=true:

Query 16 executed in: 1.907695417s and returned 27840 rows
Query 16 executed in: 1.900346945s and returned 27840 rows
Query 16 executed in: 1.913095568s and returned 27840 rows

With datafusion.execution.coalesce_batches=false:

Query 16 executed in: 3.06827928s and returned 27840 rows
Query 16 executed in: 3.052310983s and returned 27840 rows
Query 16 executed in: 3.222817185s and returned 27840 rows
andygrove commented 1 month ago

I am closing this for now, because I now understand that the shuffle write exec is essentially coalescing batches as well, and I have not yet been able to prove any benefits in adding explicit coalesce batch calls