[X] I had searched in the issues and found no similar issues.
Description
In multi-table join scenario, result of a join will be used as the input of the subsequent joins.
During the interval between updating statistics, when we update data, statistics is not collected in time, If we run a multi-table join query within this time interval, the optimizer may choose a poor strategy because there is no correct statistics.
for example:
select t1.* from example_tbl t1
join example_tbl02 t2 on t1.city=t2.city and t1.city="chengdu"
join example_tbl03 t3 on t1.city=t3.city
join example_tbl04 t4 on t1.city=t4.city
join example_tbl05 t5 on t1.city=t5.city;
### Doris-2.0.4:
this is plan(2.0.4):
in this case, example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu"; example_tbl, example_tbl03 don't have any data of city="chengdu".
After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
we can simplify this plan:
actual execution situation:
braodcast-01: distribute 2 rows of data
broadcast-02: distribute 2 rows of data
broadcast-03: distribute 120 million (2 2 30million) rows of data
broadcast-04: ditribute 3600 trillion(2 2 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
### Doris-master:
same query'plan(with same statistics data) in master branch code:
example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu";
example_tbl, example_tbl03 don't have any data of city="chengdu"
we can simplify this plan:
After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
this is actual execution situation:
braodcast-01: distribute 2 rows of data
broadcast-02: distribute 2 rows of data
broadcast-03: distribute 120 million (2 2 30million) rows of data
broadcast-04: ditribute 3600 trillion(2 2 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
Solution
I solved this problem, this is plan:
we can simplify this plan:
as you can see, the type of data's distribution has changed from broadcast to shuffle hash, which avoid OOM
there is a mis-understanding of physical plan.
join
|---->A
+--->B
here A is the left child, and B is the right child. The join children should be switched in the figures.
Search before asking
Description
In multi-table join scenario, result of a join will be used as the input of the subsequent joins. During the interval between updating statistics, when we update data, statistics is not collected in time, If we run a multi-table join query within this time interval, the optimizer may choose a poor strategy because there is no correct statistics.
for example: select t1.* from example_tbl t1 join example_tbl02 t2 on t1.city=t2.city and t1.city="chengdu" join example_tbl03 t3 on t1.city=t3.city join example_tbl04 t4 on t1.city=t4.city join example_tbl05 t5 on t1.city=t5.city;
### Doris-2.0.4:
this is plan(2.0.4):
in this case, example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu"; example_tbl, example_tbl03 don't have any data of city="chengdu". After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
we can simplify this plan:
actual execution situation: braodcast-01: distribute 2 rows of data broadcast-02: distribute 2 rows of data broadcast-03: distribute 120 million (2 2 30million) rows of data broadcast-04: ditribute 3600 trillion(2 2 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
### Doris-master:
same query'plan(with same statistics data) in master branch code: example_tbl02, example_tbl04, example_tbl05 each has 30 million rows data of city="chengdu"; example_tbl, example_tbl03 don't have any data of city="chengdu"
we can simplify this plan:
After doris triggered first statistics collection, we insert 2 rows data of city="chengdu" into example_tbl and example_tbl03 respectively, and run the query before the second time statistics collection, thus we obtained uncorrect statistics and choosed a poor plan in this query.
this is actual execution situation: braodcast-01: distribute 2 rows of data broadcast-02: distribute 2 rows of data broadcast-03: distribute 120 million (2 2 30million) rows of data broadcast-04: ditribute 3600 trillion(2 2 30 million * 30 million) rows of data
as you can see, broadcast-03 and broadcast-04 distribute huge data because of uncorrct statistics, which easy to cause OOM.
Solution
I solved this problem, this is plan:
we can simplify this plan:
as you can see, the type of data's distribution has changed from broadcast to shuffle hash, which avoid OOM
Are you willing to submit PR?
Code of Conduct