Open andygrove opened 2 months ago
Disabling sortMergeJoin via configs restores the original performance.
Thanks @andygrove I'm planning to profile it. Just for reference Q72 using INNER and LEFT join types
A secondary issue is that I do not see metrics for CometSort / CometSortMergeJoin.
Hmm, @andygrove Can you verify it again? In unit test, we have a test for CometSortMergeJoinExec metrics. It should have SQL metrics.
I ran again with latest from main (0033), and then with SMJ + join filter disabled manually (0034). Here are the event logs.
Here is a screenshot comparing the plans with SMJ+filter enabled on the left and disabled on the right.
Hmm, I will run locally to see why the metrics are not there but they are in unit test.
Im running slightly changed Q72 in DF
select i_item_desc
,w_warehouse_name
,d1.d_week_seq
,sum(case when p_promo_sk is null then 1 else 0 end) no_promo
,sum(case when p_promo_sk is not null then 1 else 0 end) promo
,count(*) total_cnt
from catalog_sales
join date_dim d1 on (cs_sold_date_sk = d1.d_date_sk)
join customer_demographics on (cs_bill_cdemo_sk = cd_demo_sk)
join household_demographics on (cs_bill_hdemo_sk = hd_demo_sk)
join item on (i_item_sk = cs_item_sk)
join inventory on (cs_item_sk = inv_item_sk)
join warehouse on (w_warehouse_sk=inv_warehouse_sk)
join date_dim d2 on (inv_date_sk = d2.d_date_sk)
join date_dim d3 on (cs_ship_date_sk = d3.d_date_sk)
left outer join promotion on (cs_promo_sk=p_promo_sk)
left outer join catalog_returns on (cr_item_sk = cs_item_sk and cr_order_number = cs_order_number)
where d1.d_week_seq = d2.d_week_seq
and inv_quantity_on_hand < cs_quantity
and d3.d_date > d1.d_date + interval '5' day
and d1.d_year = 1999
group by i_item_desc,w_warehouse_name,d1.d_week_seq
order by total_cnt desc, i_item_desc, w_warehouse_name, d_week_seq
LIMIT 100
HashJoin - 9sec, SMJ - 20 sec
Having LEFT OUTER joins removed the results are still the same
HashJoin - 8.5 SMJ - 19 sec
I'll build a flamegraph for SMJ soon
Looks like a lot of time spent on arrow_select::take::take_impl
which is specific for filtered join. take
gets filtered matched indices from the joined data
A secondary issue is that I do not see metrics for CometSort / CometSortMergeJoin.
Hmm, @andygrove Can you verify it again? In unit test, we have a test for CometSortMergeJoinExec metrics. It should have SQL metrics.
@andygrove I just ran a simple sort merge join query locally on Spark 4.0 + Comet built from latest main
branch:
Looks like the metrics are shown for CometSort / CometSortMergeJoin.
Describe the bug
When running TPC-DS benchmarks against 100 GB data set I see a large regression in performance. For example, here are the timings for q72 before and after adding support for SMJ with join condition.
Adding support for SMJ with join condition means that more of the plan is likely running natively and the performance issue isn't necessarily directly related to SMJ.
before
after
A secondary issue is that I do not see metrics for CometSort / CometSortMergeJoin.
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response