apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.21k stars 435 forks source link

[CH] A bad case for joining with mixed join conditions #6768

Closed lgbo-ustc closed 3 months ago

lgbo-ustc commented 3 months ago

Backend

CH (ClickHouse)

Bug description

[Expected behavior] and [actual behavior].

We met a query which has real bad performace on join in production environment . The query looks like below

select * from t1 left join t2 on 
t1.uid = t2.uid and (t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3)

There are two main problems.

First, The right table is very large, over 5,000,000,000 rows. Using it to build the join hash table is very resource intensive

Second, when only apply join condition t1.uid = t2.uid, it could bring a very large matching results, > 5,000,000,000 * 100. But after apply filter condition (t1.id1 = t2.id1 or t1.id2 = t2.id2 or t1.id3 = t2.id3) on this matched result, less then 10000000 rows left.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

lgbo-ustc commented 3 months ago

There will be several PRs to solve this problem