[CH] A bad case for joining with mixed join conditions

Backend

CH (ClickHouse)

Bug description

[Expected behavior] and [actual behavior].

We met a query which has real bad performace on join in production environment . The query looks like below

select * from t1 left join t2 on 
t1.uid = t2.uid and (t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3)

There are two main problems.

First, The right table is very large, over 5,000,000,000 rows. Using it to build the join hash table is very resource intensive

Second, when only apply join condition t1.uid = t2.uid, it could bring a very large matching results, > 5,000,000,000 * 100. But after apply filter condition (t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3) on this matched result, less then 10000000 rows left.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

apache / incubator-gluten