apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Tweak optimal default batch size #6916

Open Dandandan opened 11 months ago

Dandandan commented 11 months ago

Is your feature request related to a problem or challenge?

A small test to double the batch size (from 8192 to 16384) shows some performance improvements (~10%) on some queries:

Comparing main and batch_size
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ batch_size ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 190.81ms │   186.92ms │     no change │
│ QQuery 2     │  59.14ms │    55.43ms │ +1.07x faster │
│ QQuery 3     │  47.03ms │    44.95ms │     no change │
│ QQuery 4     │  37.16ms │    33.60ms │ +1.11x faster │
│ QQuery 5     │  95.66ms │    86.47ms │ +1.11x faster │
│ QQuery 6     │  10.47ms │     8.93ms │ +1.17x faster │
│ QQuery 7     │ 196.26ms │   192.92ms │     no change │
│ QQuery 8     │  69.94ms │    63.64ms │ +1.10x faster │
│ QQuery 9     │ 136.17ms │   129.56ms │     no change │
│ QQuery 10    │  99.69ms │    97.21ms │     no change │
│ QQuery 11    │  45.22ms │    45.94ms │     no change │
│ QQuery 12    │  67.66ms │    65.41ms │     no change │
│ QQuery 13    │ 182.43ms │   156.67ms │ +1.16x faster │
│ QQuery 14    │  12.36ms │    11.16ms │ +1.11x faster │
│ QQuery 15    │  23.18ms │    20.79ms │ +1.11x faster │
│ QQuery 16    │  48.14ms │    47.22ms │     no change │
│ QQuery 17    │ 681.34ms │   682.63ms │     no change │
│ QQuery 18    │ 516.47ms │   490.86ms │     no change │
│ QQuery 19    │  58.40ms │    56.68ms │     no change │
│ QQuery 20    │ 193.13ms │   202.12ms │     no change │
│ QQuery 21    │ 261.12ms │   242.51ms │ +1.08x faster │
│ QQuery 22    │  27.81ms │    28.08ms │     no change │
└──────────────┴──────────┴────────────┴───────────────┘

This is nice for such a small change and aligns with https://github.com/apache/arrow-datafusion/issues/6287

Describe the solution you'd like

Configure a (more) optimal default

Describe alternatives you've considered

No response

Additional context

No response

2010YOUY01 commented 11 months ago

That's a great point The threshold of CoalesceBatches might be a related place for tuning: https://github.com/apache/arrow-datafusion/blob/52cf58b46133d448e067455baab0faf8a50e565a/datafusion/core/src/physical_plan/coalesce_batches.rs#L230 The current implementation I think will trigger coalesce when the input batch is < default batch size For example, if two consecutive inputs of CoalesceBatchesExec have batch size 8000 (default 8192), then they will be concatenated, introducing unnecessary memcpy This will happen if the query has some high selectivity predicates (e.g. TPCH Q1), I experimented setting the coalescing threshold to target_batch_size * 0.8, Q1 can run ~20% faster