Kyligence / ClickHouse

ClickHouse® is a free analytics DBMS for big data
https://clickhouse.com
Apache License 2.0
14 stars 17 forks source link

Optimize array element function when input is array(map)/array(array(string))/array(array(number)) #466

Closed taiyang-li closed 10 months ago

taiyang-li commented 10 months ago

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Opt array element when input is array(map)

优化原理:对Array(Map(K, V))执行arrayElement时,现有实现无法提前reserve,导致大量realloc和free操作,影响执行效率,可能导致OOM

arrayElement(Array(Map(K, V))) -> arrayElement(Array(Array(Tuple(K, V)))) -> arrayElement(Array(Tuple(Array(K), Array(V)))) -> arrayElement(Array(Array(K))) + ArrayElement(Array(Array(V)))

arrayElement(Array(Array(K))ArrayElement(Array(Array(V)))中的K和V为String/Number类型时,可提前确定ColumnString和ColumnVector的大小并分配内存

kyligence-git commented 10 months ago

Can one of the admins verify this patch?

taiyang-li commented 10 months ago
优化输入类型为Array(Array(T))
select materialize(array(array(1,2,3,4)))[1] from numbers(10000000) format Null;
优化后
0 rows in set. Elapsed: 0.464 sec. Processed 10.01 million rows, 80.06 MB (21.55 million rows/s., 172.41 MB/s.)
Peak memory usage: 4.25 MiB.
0 rows in set. Elapsed: 0.455 sec. Processed 10.01 million rows, 80.06 MB (21.99 million rows/s., 175.93 MB/s.)
Peak memory usage: 4.25 MiB.
0 rows in set. Elapsed: 0.470 sec. Processed 10.01 million rows, 80.06 MB (21.28 million rows/s., 170.22 MB/s.)
Peak memory usage: 4.25 MiB

优化前
0 rows in set. Elapsed: 0.551 sec. Processed 10.01 million rows, 80.06 MB (18.18 million rows/s., 145.41 MB/s.)
Peak memory usage: 4.26 MiB.
0 rows in set. Elapsed: 0.528 sec. Processed 10.01 million rows, 80.06 MB (18.95 million rows/s., 151.63 MB/s.)
Peak memory usage: 4.26 MiB.
0 rows in set. Elapsed: 0.566 sec. Processed 10.01 million rows, 80.06 MB (17.67 million rows/s., 141.35 MB/s.)
Peak memory usage: 4.25 MiB.
taiyang-li commented 10 months ago
优化输入类型为Array(Array(String))
select materialize(array(array('hello', 'world')))[1] from numbers(10000000)  format Null 

优化后
0 rows in set. Elapsed: 0.722 sec. Processed 10.01 million rows, 80.06 MB (13.87 million rows/s., 110.94 MB/s.)
Peak memory usage: 12.06 MiB.
0 rows in set. Elapsed: 0.712 sec. Processed 10.01 million rows, 80.06 MB (14.05 million rows/s., 112.37 MB/s.)
Peak memory usage: 12.06 MiB.
0 rows in set. Elapsed: 0.722 sec. Processed 10.01 million rows, 80.06 MB (13.86 million rows/s., 110.86 MB/s.)
Peak memory usage: 12.06 MiB.

优化前
0 rows in set. Elapsed: 0.824 sec. Processed 10.01 million rows, 80.06 MB (12.14 million rows/s., 97.16 MB/s.)
Peak memory usage: 12.07 MiB.
0 rows in set. Elapsed: 0.827 sec. Processed 10.01 million rows, 80.06 MB (12.10 million rows/s., 96.78 MB/s.)
Peak memory usage: 12.07 MiB.
0 rows in set. Elapsed: 0.842 sec. Processed 10.01 million rows, 80.06 MB (11.88 million rows/s., 95.06 MB/s.)
Peak memory usage: 12.07 MiB. 
taiyang-li commented 10 months ago
优化输入类型Map

select materialize(array(map('hello', 1, 'world', 2)))[1] from numbers(10000000) format Null; 
优化前
0 rows in set. Elapsed: 6.504 sec. Processed 10.01 million rows, 80.06 MB (1.54 million rows/s., 12.31 MB/s.)
Peak memory usage: 12.58 MiB.
0 rows in set. Elapsed: 6.394 sec. Processed 10.01 million rows, 80.06 MB (1.57 million rows/s., 12.52 MB/s.)
Peak memory usage: 12.58 MiB.
0 rows in set. Elapsed: 6.398 sec. Processed 10.01 million rows, 80.06 MB (1.56 million rows/s., 12.51 MB/s.)
Peak memory usage: 12.58 MiB.

优化后
0 rows in set. Elapsed: 1.029 sec. Processed 10.01 million rows, 80.06 MB (9.72 million rows/s., 77.79 MB/s.)
Peak memory usage: 12.95 MiB.
0 rows in set. Elapsed: 1.010 sec. Processed 10.01 million rows, 80.06 MB (9.91 million rows/s., 79.26 MB/s.)
Peak memory usage: 12.95 MiB.
0 rows in set. Elapsed: 1.032 sec. Processed 10.01 million rows, 80.06 MB (9.70 million rows/s., 77.58 MB/s.)
Peak memory usage: 12.95 MiB.
taiyang-li commented 10 months ago

生产环境的数据

CREATE TABLE test_array_element ENGINE = Memory AS select events from file('test_array_element.parquet', 'Parquet', 'events Nullable(Array(Nullable(Tuple(event_id Nullable(String), log_extra Nullable(Map(String, Nullable(String))), event_info Nullable(Map(String, Nullable(String)))))))') 

select events[1] from test_array_element format Null; 

优化前
0 rows in set. Elapsed: 5.438 sec. Processed 171.96 thousand rows, 5.29 GB (31.62 thousand rows/s., 972.62 MB/s.)
Peak memory usage: 6.42 GiB.
0 rows in set. Elapsed: 5.617 sec. Processed 171.96 thousand rows, 5.29 GB (30.61 thousand rows/s., 941.60 MB/s.)
Peak memory usage: 6.39 GiB.
0 rows in set. Elapsed: 6.208 sec. Processed 171.96 thousand rows, 5.29 GB (27.70 thousand rows/s., 851.97 MB/s.)
Peak memory usage: 6.39 GiB.

优化后
0 rows in set. Elapsed: 2.112 sec. Processed 171.96 thousand rows, 5.29 GB (81.42 thousand rows/s., 2.50 GB/s.)
Peak memory usage: 6.45 GiB.
0 rows in set. Elapsed: 1.864 sec. Processed 171.96 thousand rows, 5.29 GB (92.24 thousand rows/s., 2.84 GB/s.)
Peak memory usage: 6.45 GiB.
0 rows in set. Elapsed: 1.861 sec. Processed 171.96 thousand rows, 5.29 GB (92.41 thousand rows/s., 2.84 GB/s.)
Peak memory usage: 6.45 GiB.