apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.33k stars 1.2k forks source link

Potential performance regression for TPCH q18 #13188

Open alamb opened 3 weeks ago

alamb commented 3 weeks ago

Describe the bug

While enabling StringView reading from Parquet in https://github.com/apache/datafusion/pull/13101 @Dandandan noticed a slight regression for TPCH 18 https://github.com/apache/datafusion/pull/13101#issuecomment-2437865910

here is the query

select
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice,
    sum(l_quantity)
from
    customer,
    orders,
    lineitem
where
        o_orderkey in (
        select
            l_orderkey
        from
            lineitem
        group by
            l_orderkey having
                sum(l_quantity) > 300
    )
  and c_custkey = o_custkey
  and o_orderkey = l_orderkey
group by
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice
order by
    o_totalprice desc,
    o_orderdate;

To Reproduce

To reproduce

Make data

# make the data and get to the correct location
cd datafusion/benchmarks
./bench.sh data tpch
cd data/tpch_sf1

Run query:

datafusion-cli -f ../../queries/q18.sql  | grep Elapsed
Elapsed 0.088 seconds.

When StringView is enabled it seems like it is slightly slower

Expected behavior

StringView should always be faster

Additional context

I took a brief look at the flamegraphs -- it seems like one difference could be BatchCoalescer::push_batch

Screenshot 2024-10-30 at 2 13 38 PM

There is a special case for StringView here: https://github.com/apache/datafusion/blob/6034be42808b43e3f48f6e58ec38cc35fa253abb/datafusion/physical-plan/src/coalesce/mod.rs#L117-L116

Here are the explain plans for the query before and after the change

Here are the flamegraphs for the query before/after the change

alamb commented 3 weeks ago

This may be fixed by fixing https://github.com/apache/datafusion/issues/11628

devanbenz commented 2 weeks ago

take

jayzhan211 commented 2 weeks ago

I will take a look on this issue too, how is your progress on this @devanbenz ?

devanbenz commented 2 weeks ago

I will take a look on this issue too, how is your progress on this @devanbenz ?

Looking at the following code path:

https://github.com/apache/datafusion/blob/9005585fa6f4eb6a4d0cc515b6ad76794c33c626/datafusion/physical-plan/src/coalesce/mod.rs#L246-L263

I noticed when writing up a small example using this code path locally.

Screenshot 2024-11-03 at 12 30 19 PM

When taking a look at the append_option source code it implements a copy.

I've identified where the performance fault is--but I'm unsure what next steps to take to try and alleviate it. I was going to maybe modify the coalescer to try and "change the RecordBatch to stringview if the datatype is Utf8View" as the input occurs around here (if that makes sense):

https://github.com/apache/datafusion/blob/9005585fa6f4eb6a4d0cc515b6ad76794c33c626/datafusion/physical-plan/src/coalesce_batches.rs#L297-L307

I don't have a meaningful plan yet just have been doing some exploratory work as of right now while benchmarking locally.

alamb commented 2 weeks ago

https://github.com/apache/datafusion/issues/11628 might have some ideas. I think @XiangpengHao has also been thinking of some way to do this too

Basically one thing you might be able to try is to switch from using take to directly building the output (with a StringViewBuilder or something equvialent)

But as you are hinting at this is a non trivial thing to do

alamb commented 2 weeks ago

I finally filed a ticket upstream in arrow trying to explain what I have been thinking about:

jayzhan211 commented 2 weeks ago

Other than https://github.com/apache/arrow-rs/issues/6692. If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192). We can then aggregate those small batches in filter exec and produce a single large output for next step. Does this sound a possible improvement?

I think the downside is again we need to implement builder per type like GroupColumn and the improvement is unclear

alamb commented 2 weeks ago

If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192)

Would this be a specialization of a filter of the aggregates like for SELECT SUM(x) FROM ... HAVING SUM(x) > 5 ?

jayzhan211 commented 2 weeks ago

If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192)

Would this be a specialization of a filter of the aggregates like for SELECT SUM(x) FROM ... HAVING SUM(x) > 5 ?

Probably, but I think it has potential to reduce copied too https://github.com/apache/datafusion/issues/11628. It aggregates at the first place similar to coalescer's role and no gc required since we aggregate the value without necessary buffer copied.

devanbenz commented 2 days ago

If someone else would like to take this please let me know. I may not be able to look at this again until the upcoming weekend.

jayzhan211 commented 2 days ago

If someone else would like to take this please let me know. I may not be able to look at this again until the upcoming weekend.

Feel free to work on it since this kind of task is usually not that trivial and everyone may come out a different solution (less chance of duplicated work)