apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.42k stars 183 forks source link

Add config to collect statistics, enable in TPC-H benchmark #796

Closed Dandandan closed 1 year ago

Dandandan commented 1 year ago

Which issue does this PR close?

Closes #797

Allows configuring the CollectLeft threshold in Ballista and enable collecting statistics in TPC-H benchmark.

(2,5,7,8,9,10,11,14,15,16,21 are faster with statistics available because of better join selection)

without collecting stats with collecting stats
1 1450.91 1451.59
2 1462.58 846.69
3 1146.47 1142.10
4 935.40 936.54
5 1767.56 1357.01
6 516.96 518.90
7 2114.79 1667.12
8 2503.08 1707.93
9 2175.66 1734.37
10 1403.69 1158.69
11 1115.91 765.67
12 1080.43 1080.57
13 1244.99 1243.37
14 938.96 519.41
15 1683.18 1511.17
16 1336.33 1151.75
17 1963.21 1963.03
18 2001.95 2002.84
19 1294.52 1259.65
20 1250.44 1248.19
21 1771.44 1555.07

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?