Open alamb opened 1 year ago
Is the majority of the concat_batches overhead string concatenation? Just wondering if this is something that StringView might help with?
From an internal conversation, @tustvold points out that interleave
is fair bit slower than the filter
kernel, especially for primitives
The current code uses filter
then concat_batches
The proposal would use interleave
Thus we may have to potentially special case how the columns are handled depending on type (e.g. what kernels are called for what operations)
I tried (https://github.com/Dandandan/arrow-datafusion/tree/buffer_batches) to convert the code to use interleave
instead of concat
(CoalesceBatches
) (for hash repartitioning), but this leads to no improvement whatsoever (even a small slowdown):
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ buffer_batches ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 93.31ms │ 90.88ms │ no change │
│ QQuery 2 │ 26.18ms │ 28.39ms │ 1.08x slower │
│ QQuery 3 │ 53.96ms │ 57.86ms │ 1.07x slower │
│ QQuery 4 │ 52.33ms │ 51.77ms │ no change │
│ QQuery 5 │ 114.93ms │ 125.69ms │ 1.09x slower │
│ QQuery 6 │ 9.38ms │ 9.20ms │ no change │
│ QQuery 7 │ 211.60ms │ 227.30ms │ 1.07x slower │
│ QQuery 8 │ 61.54ms │ 59.56ms │ no change │
│ QQuery 9 │ 60.66ms │ 63.80ms │ 1.05x slower │
│ QQuery 10 │ 114.31ms │ 114.95ms │ no change │
│ QQuery 11 │ 20.12ms │ 20.54ms │ no change │
│ QQuery 12 │ 59.98ms │ 59.26ms │ no change │
│ QQuery 13 │ 53.37ms │ 56.42ms │ 1.06x slower │
│ QQuery 14 │ 18.47ms │ 18.63ms │ no change │
│ QQuery 15 │ 58.99ms │ 59.31ms │ no change │
│ QQuery 16 │ 22.79ms │ 22.27ms │ no change │
│ QQuery 17 │ 54.41ms │ 53.59ms │ no change │
│ QQuery 18 │ 158.61ms │ 162.68ms │ no change │
│ QQuery 19 │ 36.98ms │ 34.78ms │ +1.06x faster │
│ QQuery 20 │ 62.49ms │ 62.04ms │ no change │
│ QQuery 21 │ 251.42ms │ 256.99ms │ no change │
│ QQuery 22 │ 14.35ms │ 14.14ms │ no change │
└──────────────┴──────────┴────────────────┴───────────────┘
Another finding to add to the list is that changing RepartitionExec
to do a filter (with hash repartitioning) is a bit slower than take
or interleave
on tpc-h benchmarks.
Thanks for the info @Dandandan -- those are some interesting results
I made some pictures for what I had in mind
This is what happens today (FilterExec
+ RepartitionExec
):
┌────────────────────┐ Filter
│ │ ┌────────────────────┐ Coalesce
│ │ ─ ─ ─ ─ ─ ─ ▶ │ RecordBatch │ Batches
│ RecordBatch │ │ num_rows = 234 │─ ─ ─ ─ ─ ┐
│ num_rows = 8000 │ └────────────────────┘
│ │ │
│ │ ┌────────────────────┐
└────────────────────┘ │ │ │
┌────────────────────┐ ┌────────────────────┐ │ │
│ │ Filter │ │ │ │ │
│ │ │ RecordBatch │ ─ ─ ─ ─ ─ ▶│ │
│ RecordBatch │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 500 │─ ─ ─ ─ ─ ┐ │ │
│ num_rows = 8000 │ │ │ │ RecordBatch │
│ │ │ │ └ ─ ─ ─ ─ ─▶│ num_rows = 8000 │
│ │ └────────────────────┘ │ │
└────────────────────┘ │ │
... ─ ─ ─ ─ ─ ▶│ │
... ... │ │ │
│ │
┌────────────────────┐ │ └────────────────────┘
│ │ ┌────────────────────┐
│ │ Filter │ │ │
│ RecordBatch │ │ RecordBatch │
│ num_rows = 8000 │ ─ ─ ─ ─ ─ ─ ▶ │ num_rows = 333 │─ ─ ─ ─ ─ ┘
│ │ │ │
│ │ └────────────────────┘
└────────────────────┘
FilterExec RepartitonExec copies the data
creates output batches with copies *again* to form final large
of the matching rows (calls take() RecordBatches
to make a copy)
Instead do it like this (only buffer the filter mask results and create the final output in one go):
┌────────────────────┐ Filter
│ │ ┌────────────────────┐ Filter output
│ │ ─ ─ ─ ─ ─ ─ ▶ │ mask │
│ RecordBatch │ │ (BooleanArray) │─ ─ ─ ─ ─ ┐
│ num_rows = 8000 │ └────────────────────┘
│ │ │
│ │ ┌────────────────────┐
└────────────────────┘ │ │ │
┌────────────────────┐ ┌────────────────────┐ │ │
│ │ Filter │ │ │ │ │
│ │ │ mask │ ─ ─ ─ ─ ─ ▶│ │
│ RecordBatch │ ─ ─ ─ ─ ─ ─ ▶ │ (BooleanArray) │─ ─ ─ ─ ─ ┐ │ │
│ num_rows = 8000 │ │ │ │ RecordBatch │
│ │ │ │ └ ─ ─ ─ ─ ─▶│ num_rows = 8000 │
│ │ └────────────────────┘ │ │
└────────────────────┘ │ │
... ─ ─ ─ ─ ─ ▶│ │
... ... │ │ │
│ │
┌────────────────────┐ │ └────────────────────┘
│ │ ┌────────────────────┐
│ │ Filter │ │ │
│ RecordBatch │ │ mask │
│ num_rows = 8000 │ ─ ─ ─ ─ ─ ─ ▶ │ (BooleanArray) │─ ─ ─ ─ ─ ┘
│ │ │ │
│ │ └────────────────────┘
└────────────────────┘
FilterExec et all The Exec then copies rows from
internally buffers RecordBatches and multiple input batches (based on the
filter results until enough rows are masks) into a single new output
ready RecordBatch avoiding the second data
copy
Maybe we could add a filter_multi
kernel along side https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html
The signature would be like
/// filters paris of arrays / predicates into a single output arrauy
pub fn filter_multi(
input: impl IntoIterator<Item = (&dyn Array, &BooleanArray>)
) -> Result<Arc<dyn Array>, ArrowError>
And then the internal machinery could be the same
TPCH-Q1's expensive CoalesceBatches
might be easier to solve 🤔
Unlike TPCH-Q8 (looks like with more selective predicate), Q1's filter only throws away a small number of rows:
input_batch(4096 rows) -> [FILTER] -> output_batch(~4000 rows)
And the following CoalesceBatches
condition will be triggered every time to copy large output batches
https://github.com/apache/datafusion/blob/49d9d45f36989cd448ed6513af65948b6b0100ec/datafusion/physical-plan/src/coalesce_batches.rs#L228
But if it didn't do coalescing, the output batch still benefits from vectorization, so maybe this coalescing threshold can be better tuned to like if self.buffered_rows >= 0.6 * self.target_batch_size {
I remember I tried before, but the overall performance on Q1 is like 2% improvement, I think it's possible to set a better threshold for triggering coalescing.
Thanks @2010YOUY01 -- that sounds neat. Also thanks for the tip related to Q1 as a good candidate query this could improve.
In general I agree there are likely easier ways to improve specific queries. I am still bullish on the more general idea off trying to avoid the copy (I think it would be good for several percentage points across many queries), though it will be a fairly substantial taks
I am going to spend some more time prototyping and see how it goes
I am pretty happy with how this is headed.
I have a PR to start refactoring the code here https://github.com/apache/datafusion/pull/11610
I am also feeling good about my WIP prototype
Is your feature request related to a problem or challenge?
While looking at TPCH query performance for #6782 I noticed several times that
CoalesceBatchesExec
takes non trivial amounts of time (like 5% of the overall query time)Here is a specific examples
Create Data:
Run query with datafusion-cli:
Here is the full
EXPLAIN ANLAYZE
output: explan-analyze-q8.txtA small subset shows there is a single
CoalesceBatchesExec
that takes 3 seconds (elapsed_compute=3.066514072s
):I profiled the query and confirmed that
CoalesceBatchExec
takes 5% of the overall time, as shown in this screen shotIn diagrams this looks like
Describe the solution you'd like
I think we can avoid this overhead by combining the behavior of
CoalesceBatchesExec
into the operators that make small batches (FilterExec
,JoinExec
, andRepartitionExec
). Something likeThe idea would be to take the core coalesce logic from
CoalesceBatchesExec
that callsconcat_batches
And instead of creating new small record batches inFilterExec
,HashJoinExec
, andRepartitionExec
buffer the inputs until there are at leasttarget_batch_size
rows available, and then callinterleave
insteadHere is the code in
CoalesceBatchesExec
that could be adapted: https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/coalesce_batches.rs#L215-L283Here is where FilterExec makes the potentially small record batches https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/filter.rs#L294-L306
The same think would be done in RepartitionExec: https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/repartition/mod.rs#L193-L218
Describe alternatives you've considered
No response
Additional context
No response