We met a query which has real bad performace on join in production environment . The query looks like below
select * from t1 left join t2 on
t1.uid = t2.uid and (t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3)
There are two main problems.
First, The right table is very large, over 5,000,000,000 rows. Using it to build the join hash table is very resource intensive
Second, when only apply join condition t1.uid = t2.uid, it could bring a very large matching results, > 5,000,000,000 * 100. But after apply filter condition (t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3) on this matched result, less then 10000000 rows left.
reorder hash join tables, make sure the smaller table is used to build the hash table
perfer to convert a mixed join condition into multi join on clauses when the row table is small enough that can be safe to hold all the hash tables in memory
Backend
CH (ClickHouse)
Bug description
[Expected behavior] and [actual behavior].
We met a query which has real bad performace on join in production environment . The query looks like below
There are two main problems.
First, The right table is very large, over 5,000,000,000 rows. Using it to build the join hash table is very resource intensive
Second, when only apply join condition
t1.uid = t2.uid
, it could bring a very large matching results, > 5,000,000,000 * 100. But after apply filter condition(t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3)
on this matched result, less then 10000000 rows left.Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response