This is a derived work from https://github.com/hortonworks/hive-testbench
This benchmark requires Apache Kylin 2.0 or later. The goal at this stage is functionality rather than performance. Enabling all TPC-H queries is a new milestone of Apache Kylin's SQL capability.
Prepare TPCH test data
Follow the steps to setup test data in Hive.
For example, generate a minimal data with ScaleFactor=2
git clone https://github.com/hortonworks/hive-testbench.git
cd hive-testbench/
./tpch-build.sh
./tpch-setup.sh 2
Verify the TPCH tables existed in hive
echo "use tpch_flat_orc_2; show tables;" | hive
Import Kylin model
Get this repository.
git clone https://github.com/Kyligence/kylin-tpch.git
Make sure there is a running Kylin instance.
export KYLIN_HOME=...
Run the setup-kylin-model.sh
with the SAME SCALE FACTOR to generate hive data, for example:
cd kylin-tpch
./setup-kylin-model.sh 2
The script also creates a few simple views on top of the original TPCH tables to allow Kylin pre-calculate some complex measures. The resulted E-R model topology is identical to the original TPCH model.
Build Kylin cube
In the Kylin web, click "System->Reload Metadata" to refresh the newly imported TPCH model.
Select project "tpch" and find the 4 cubes below.
lineitem_cube
and partsupp_cube
together covers 20 TPCH queries out of the total 22. Besides, customer_vorder_cube
is for query 13 and customer_cube
is for query 22.
Build all cubes within the year range from 1992 to 1998.
Verify by counting v_lineitem
in Kylin. The row count must match the Hive table lineitem
.
select count(*) from v_lineitem
Run TPCH queries
query
folder. Run them via the Kylin web, or Kylin REST API / JDBC / ODBC.queries/original-queries
. The original and the adapted queries should give the same result in Hive and Kylin respectively.tools/query-tool.py
. For Kylin, it prints query duration. For Hive, it prints query duration and query results.
pip install requests
python tools/query-tool.py -s http://127.0.0.1:7070/kylin -u ADMIN -p KYLIN -d queries -o tpch -r 3 -t kylin
python tools/query-tool.py -t hive -c 10
This benchmark was tested in two environments. One enviroment is HDP sandbox, another is a 4-node CDH cluster. Each query was run multiple times and the final stable response time get recorded for both Kylin and Hive.
And again, the goal at this stage is SQL capability rather than performance. No special performance tuning was done to both Kylin and Hive.
Btw, KyBot is a great tool if you want to analyze Kylin cube and query performance. Upload a Kylin diagnosis pack to this online service and it can show you why a query runs slow or why a cube takes long to build. KyBot is free for trial.
Cube Build Stats | HDP Sandbox (SF=2) | CDH 4-Node Cluster (SF=10) |
---|---|---|
lineitem_cube | 20.1 GB, 125 mins * 7 | 85.8 GB, 54.69 mins * 7 |
partsupp_cube | 1.6 GB, 27.02 mins | 8.3 GB, 18.48 mins |
customer_vorder_cube | 283 MB, 16.88 mins | 1.39 GB, 13.7 mins |
customer_cube | 7 MB, 3.95 mins | 36.98 MB, 3.97 mins |
(The lineitem_cube was built incrementally. Each increment builds 1 years data.)
Query Response Time | Kylin on HDP (SF=2) | Hive+Tez on HDP (SF=2) | Kylin on CDH (SF=10) | Hive+MR on CDH (SF=10) |
---|---|---|---|---|
query01 | 0.43 sec | 13.79 sec | 0.45 sec | 43.45 sec |
query02 | 8.87 sec | 17.04 sec | 8.77 sec | 3570 sec |
query03 | 5.2 sec | 16.32 sec | 4.84 sec | 137.4 sec |
query04 | 0.71 sec | 31.23 sec | 2.69 sec | 90.99 sec |
query05 | 0.33 sec | 23.68 sec | 0.47 sec | 162.79 sec |
query06 | 0.34 sec | 8.91 sec | 0.39 sec | 23.29 sec |
query07 | 0.17 sec | 35.23 sec | 0.24 sec | 245.52 sec |
query08 | 7.42 sec | 19.4 sec | 8.82 sec | 2248.5 sec |
query09 | 10.4 sec | 30.75 sec | 34.8 sec | 19869 sec |
query10 | 21.1 sec | 19.03 sec | 11.91 sec | 120.61 sec |
query11 | 3.42 sec | 15.87 sec | 2.09 sec | 123.3 sec |
query12 | 7.66 sec | 12.64 sec | 5.73 sec | 96.34 sec |
query13 | 20.44 sec | 24.61 sec | 104.77 sec | 135.64 sec |
query14 | 1.1 sec | 4.54 sec | 1.56 sec | 54.42 sec |
query15 | 1.96 sec | 13.06 sec | 17.02 sec | 124.29 sec |
query16 | 3.69 sec | 17.36 sec | 10.16 sec | 144.34 sec |
query17 | 5.49 sec | 25.61 sec | 10.19 sec | 183.56 sec |
query18 | 22.43 sec | 65.32 sec | 74.88 sec | 231.9 sec |
query19 | 21.9 sec | 22.5 sec | 138.81 sec | 186.87 sec |
query20 | 6.46 sec | 16.21 sec | 42.02 sec | 229.01 sec |
query21 | 58.6 sec | 84.6 sec | 136.3 sec | 276.59 sec |
query22 | 23.21 sec | 21.79 sec | 102.81 sec | 154.48 sec |
HDP Sandbox Configuration
CDH 4-Node Cluster Configuration