Open Dandandan opened 11 months ago
That's a great point
The threshold of CoalesceBatches
might be a related place for tuning:
https://github.com/apache/arrow-datafusion/blob/52cf58b46133d448e067455baab0faf8a50e565a/datafusion/core/src/physical_plan/coalesce_batches.rs#L230
The current implementation I think will trigger coalesce when the input batch is < default batch size
For example, if two consecutive inputs of CoalesceBatchesExec
have batch size 8000 (default 8192), then they will be concatenated, introducing unnecessary memcpy
This will happen if the query has some high selectivity predicates (e.g. TPCH Q1), I experimented setting the coalescing threshold to target_batch_size
* 0.8, Q1 can run ~20% faster
Is your feature request related to a problem or challenge?
A small test to double the batch size (from 8192 to 16384) shows some performance improvements (~10%) on some queries:
This is nice for such a small change and aligns with https://github.com/apache/arrow-datafusion/issues/6287
Describe the solution you'd like
Configure a (more) optimal default
Describe alternatives you've considered
No response
Additional context
No response