Open alamb opened 3 weeks ago
This may be fixed by fixing https://github.com/apache/datafusion/issues/11628
take
I will take a look on this issue too, how is your progress on this @devanbenz ?
I will take a look on this issue too, how is your progress on this @devanbenz ?
Looking at the following code path:
I noticed when writing up a small example using this code path locally.
When taking a look at the append_option
source code it implements a copy.
I've identified where the performance fault is--but I'm unsure what next steps to take to try and alleviate it. I was going to maybe modify the coalescer to try and "change the RecordBatch to stringview if the datatype is Utf8View
" as the input occurs around here (if that makes sense):
I don't have a meaningful plan yet just have been doing some exploratory work as of right now while benchmarking locally.
https://github.com/apache/datafusion/issues/11628 might have some ideas. I think @XiangpengHao has also been thinking of some way to do this too
Basically one thing you might be able to try is to switch from using take
to directly building the output (with a StringViewBuilder or something equvialent)
But as you are hinting at this is a non trivial thing to do
I finally filed a ticket upstream in arrow trying to explain what I have been thinking about:
Other than https://github.com/apache/arrow-rs/issues/6692.
If we create filter version of GroupColumn
that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192). We can then aggregate those small batches in filter exec and produce a single large output for next step. Does this sound a possible improvement?
I think the downside is again we need to implement builder per type like GroupColumn
and the improvement is unclear
If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192)
Would this be a specialization of a filter of the aggregates like for SELECT SUM(x) FROM ... HAVING SUM(x) > 5
?
If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192)
Would this be a specialization of a filter of the aggregates like for
SELECT SUM(x) FROM ... HAVING SUM(x) > 5
?
Probably, but I think it has potential to reduce copied too https://github.com/apache/datafusion/issues/11628. It aggregates at the first place similar to coalescer's role and no gc required since we aggregate the value without necessary buffer copied.
If someone else would like to take this please let me know. I may not be able to look at this again until the upcoming weekend.
If someone else would like to take this please let me know. I may not be able to look at this again until the upcoming weekend.
Feel free to work on it since this kind of task is usually not that trivial and everyone may come out a different solution (less chance of duplicated work)
Describe the bug
While enabling
StringView
reading from Parquet in https://github.com/apache/datafusion/pull/13101 @Dandandan noticed a slight regression for TPCH 18 https://github.com/apache/datafusion/pull/13101#issuecomment-2437865910here is the query
To Reproduce
To reproduce
Make data
Run query:
When StringView is enabled it seems like it is slightly slower
Expected behavior
StringView should always be faster
Additional context
I took a brief look at the flamegraphs -- it seems like one difference could be
BatchCoalescer::push_batch
There is a special case for StringView here: https://github.com/apache/datafusion/blob/6034be42808b43e3f48f6e58ec38cc35fa253abb/datafusion/physical-plan/src/coalesce/mod.rs#L117-L116
Here are the explain plans for the query before and after the change
Here are the flamegraphs for the query before/after the change