Open l1t1 opened 11 months ago
a 1/10 size parquet has the same problem
root@localhost:/data/data/com.termux/files/home# time python3 -m chdb "SELECT avg(i) FROM file('/data/t2.parquet') group by round(log10(i)); " Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
โโโโโโโโโโโโโโ
โ avg(i) โ
โกโโโโโโโโโโโโโฉ
โ 2 โ
โโโโโโโโโโโโโโค
โ 17392527 โ
โโโโโโโโโโโโโโค
โ 17392.5 โ
โโโโโโโโโโโโโโค
โ 1739.5 โ
โโโโโโโโโโโโโโค
โ 173925 โ
โโโโโโโโโโโโโโค
โ 17.5 โ
โโโโโโโโโโโโโโค
โ 174 โ
โโโโโโโโโโโโโโค
โ 65811388.5 โ
โโโโโโโโโโโโโโค
โ 1739252.5 โ
โโโโโโโโโโโโโโ
real 0m38.475s
user 0m37.700s
sys 0m17.630s
root@localhost:/data/data/com.termux/files/home# time ./ch239 --local -q "SELECT avg(i) FROM file('/data/t2.parquet') group by round(log10(i));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
Lowered mark cache size to 2.79 GiB because the system has limited RAM
2
17392527
17392.5
1739.5
173925
17.5
174
65811388.5
1739252.5
real 0m13.121s
user 0m24.100s
sys 0m6.240s
Thanks for the report @l1t1 we appreciate it! This definitely helps refine our scope. The next chdb versions should improve any gaps in performance, although clickhouse-local and chdb are built slightly different (jemalloc for one) so details about the execution context are extremely important. I cannot reproduce this performance gap on my system for instance.
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
This is quite odd. Could you tell us more about the running environment?
not only query file, query numbers_mt() has the same question. the max_threads values are the same.
root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
2
17392527
17392.5
1739.5
173925
17.5
174
65811388.5
1739252.5
real 0m1.535s
user 0m7.530s
sys 0m0.200s
root@localhost:/data/data/com.termux/files/home# time python3 -m chdb "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number))" Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
โโโโโโโโโโโโโโโ
โ avg(number) โ
โกโโโโโโโโโโโโโโฉ
โ 2 โ
โโโโโโโโโโโโโโโค
โ 17392527 โ
โโโโโโโโโโโโโโโค
โ 17392.5 โ
โโโโโโโโโโโโโโโค
โ 1739.5 โ
โโโโโโโโโโโโโโโค
โ 173925 โ
โโโโโโโโโโโโโโโค
โ 17.5 โ
โโโโโโโโโโโโโโโค
โ 174 โ
โโโโโโโโโโโโโโโค
โ 65811388.5 โ
โโโโโโโโโโโโโโโค
โ 1739252.5 โ
โโโโโโโโโโโโโโโ
real 0m3.695s
user 0m8.820s
sys 0m0.460s
root@localhost:/data/data/com.termux/files/home# time python3 -m chdb "SELECT getSetting('max_threads')" Pretty <jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ getSetting('max_threads') โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ 8 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
real 0m2.337s
user 0m1.500s
sys 0m0.380s
root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT getSetting('max_threads');" <jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
8
real 0m0.609s
user 0m0.390s
sys 0m0.110s
I run them on termux + proot-distro ubuntu 22.04 a phone has 8-core CPU and 6GB memory
Termux Variables:
TERMUX_APK_RELEASE=GITHUB
TERMUX_APP_PACKAGE_MANAGER=apt
TERMUX_APP_PID=3239
TERMUX_IS_DEBUGGABLE_BUILD=1
TERMUX_MAIN_PACKAGE_FORMAT=debian
TERMUX_VERSION=0.118.0
Packages CPU architecture:
aarch64
Subscribed repositories:
# sources.list
deb https://mirrors.tuna.tsinghua.edu.cn/termux/apt/termux-main stable main
# tur-repo (sources.list.d/tur.list)
deb https://tur.kcubeterm.com tur-packages tur tur-on-device tur-continuous
Updatable packages:
apt/stable 2.7.6 aarch64 [upgradable from: 2.7.3]
brotli/stable 1.1.0 aarch64 [upgradable from: 1.0.9-1]
command-not-found/stable 2.3.0-3 aarch64 [upgradable from: 2.3.0-1]
coreutils/stable 9.4 aarch64 [upgradable from: 9.3]
curl/stable 8.3.0 aarch64 [upgradable from: 8.2.1]
dos2unix/stable 7.5.1 aarch64 [upgradable from: 7.5.0]
libarchive/stable 3.7.2 aarch64 [upgradable from: 3.7.1]
libcurl/stable 8.3.0 aarch64 [upgradable from: 8.2.1]
libimagequant/stable 4.2.1 aarch64 [upgradable from: 4.2.0]
libsqlite/stable 3.43.1 aarch64 [upgradable from: 3.42.0]
libunbound/stable 1.18.0 aarch64 [upgradable from: 1.17.1-2]
libwebp/stable 1.3.2 aarch64 [upgradable from: 1.3.1-2]
openssl/stable 1:3.1.3 aarch64 [upgradable from: 1:3.1.2]
python-numpy/stable 1.26.0 aarch64 [upgradable from: 1.25.2]
python-pandas/tur-packages 2.1.1 aarch64 [upgradable from: 2.1.0]
python-pillow/stable 10.0.1 aarch64 [upgradable from: 10.0.0]
rust/stable 1.72.1 aarch64 [upgradable from: 1.70.0-1]
unbound/stable 1.18.0 aarch64 [upgradable from: 1.17.1-2]
vim-runtime/stable 9.0.1900 all [upgradable from: 9.0.1800]
vim/stable 9.0.1900 aarch64 [upgradable from: 9.0.1800]
termux-tools version:
1.39.0
Android version:
8.1.0
Kernel build information:
Linux localhost 4.4.78-perf-g6520db4 #1 SMP PREEMPT Sat Dec 8 00:50:29 CST 2018 aarch64 Android
Device manufacturer:
Xiaomi
Device model:
Mi Note 3
LD Variables:
LD_LIBRARY_PATH=
LD_PRELOAD=/data/data/com.termux/files/usr/lib/libtermux-exec.so
another phone didn't show jmalloc message, has the same question too
root@localhost:/data/data/com.termux/files/home# time ./clickhouse --local -q "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number));"
2
17392527
17392.5
1739.5
173925
17.5
174
65811388.5
1739252.5
real 0m0.617s
user 0m3.563s
sys 0m0.145s
root@localhost:/data/data/com.termux/files/home# time python3 -m chdb "SELECT avg(number) FROM numbers_mt(1,100000000) group by round(log10(number))" Pretty
โโโโโโโโโโโโโโโ
โ avg(number) โ
โกโโโโโโโโโโโโโโฉ
โ 2 โ
โโโโโโโโโโโโโโโค
โ 17392527 โ
โโโโโโโโโโโโโโโค
โ 17392.5 โ
โโโโโโโโโโโโโโโค
โ 1739.5 โ
โโโโโโโโโโโโโโโค
โ 173925 โ
โโโโโโโโโโโโโโโค
โ 17.5 โ
โโโโโโโโโโโโโโโค
โ 174 โ
โโโโโโโโโโโโโโโค
โ 65811388.5 โ
โโโโโโโโโโโโโโโค
โ 1739252.5 โ
โโโโโโโโโโโโโโโ
real 0m1.688s
user 0m3.928s
sys 0m0.208s
the cpu info
~ $ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: ARM
Model name: Cortex-A55
Model: 0
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r1p0
CPU(s) scaling MHz: 72%
CPU max MHz: 1863.0000
CPU min MHz: 554.0000
BogoMIPS: 3.84
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrc
pc dcpop asimddp
Model name: -
Model: 0
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 2
Stepping: 0x3
CPU(s) scaling MHz: 65%
CPU max MHz: 2861.0000
CPU min MHz: 826.0000
BogoMIPS: 3.84
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrc
pc dcpop asimddp
maybe it only occurs on a slow CPU, at faster phone, when increasing the data size, chdb runs as fast as clickhouse local.
root@localhost:/data/data/com.termux/files/home# time ./clickhouse --local -q "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number));"
2
658113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5
real 0m4.716s
user 0m35.153s
sys 0m0.238s
root@localhost:/data/data/com.termux/files/home# time ./clickhouse --local -q "SELECT avg(number) FROM numbers_mt(1,2000000000) group by round(log10(number));"
2
1158113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5
real 0m9.257s
user 1m10.508s
sys 0m0.407s
root@localhost:~# time python3 -m chdb "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number))" Pretty
โโโโโโโโโโโโโโโ
โ avg(number) โ
โกโโโโโโโโโโโโโโฉ
โ 2 โ
โโโโโโโโโโโโโโโค
โ 658113883.5 โ
โโโโโโโโโโโโโโโค
โ 17392527 โ
โโโโโโโโโโโโโโโค
โ 17392.5 โ
โโโโโโโโโโโโโโโค
โ 1739.5 โ
โโโโโโโโโโโโโโโค
โ 173925 โ
โโโโโโโโโโโโโโโค
โ 17.5 โ
โโโโโโโโโโโโโโโค
โ 174 โ
โโโโโโโโโโโโโโโค
โ 173925271.5 โ
โโโโโโโโโโโโโโโค
โ 1739252.5 โ
โโโโโโโโโโโโโโโ
real 0m5.027s
user 0m33.919s
sys 0m0.405s
root@localhost:~# time python3 -m chdb "SELECT avg(number) FROM numbers_mt(1,2000000000) group by round(log10(number))" Pretty
โโโโโโโโโโโโโโโโ
โ avg(number) โ
โกโโโโโโโโโโโโโโโฉ
โ 2 โ
โโโโโโโโโโโโโโโโค
โ 1158113883.5 โ
โโโโโโโโโโโโโโโโค
โ 17392527 โ
โโโโโโโโโโโโโโโโค
โ 17392.5 โ
โโโโโโโโโโโโโโโโค
โ 1739.5 โ
โโโโโโโโโโโโโโโโค
โ 173925 โ
โโโโโโโโโโโโโโโโค
โ 17.5 โ
โโโโโโโโโโโโโโโโค
โ 174 โ
โโโโโโโโโโโโโโโโค
โ 173925271.5 โ
โโโโโโโโโโโโโโโโค
โ 1739252.5 โ
โโโโโโโโโโโโโโโโ
real 0m9.718s
user 1m8.479s
sys 0m0.670s
Thanks for the @l1t1 this is interesting and most likely means the performance issue is related to jemalloc use in chdb or even with build options for certain processors or instruction sets. Could we compare the two CPU features by any chance?
the slow CPU mentions is Qualcomm Snapdragon 660 Processor
~ $ free
total used free shared buff/cache available
Mem: 5849420 2074308 424304 28024 3350808 3515508
Swap: 2621436 327476 2293960
~ $ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: Qualcomm
Model name: Kryo-V2
Model: 4
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 0xa
CPU(s) scaling MHz: 76%
CPU max MHz: 1843.2000
CPU min MHz: 633.6000
BogoMIPS: 38.40
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
Model name: Falkor-V1/Kryo
Model: 2
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 0xa
CPU(s) scaling MHz: 50%
CPU max MHz: 2208.0000
CPU min MHz: 1113.6000
BogoMIPS: 38.40
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
Caches (sum of all):
L1d: 384 KiB (8 instances)
L1i: 384 KiB (8 instances)
L2: 2 MiB (2 instances)
when increasing data size(not the parquet file size), the gap is smaller.
root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT avg(number) FROM numbers_mt(1,400000000) group by round(log10(number));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
2
358113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5
real 0m4.258s
user 0m28.420s
sys 0m0.460s
root@localhost:/data/data/com.termux/files/home# time python3 -m chdb "SELECT avg(number) FROM numbers_mt(1,400000000) group by round(log10(number))" Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
โโโโโโโโโโโโโโโ
โ avg(number) โ
โกโโโโโโโโโโโโโโฉ
โ 2 โ
โโโโโโโโโโโโโโโค
โ 358113883.5 โ
โโโโโโโโโโโโโโโค
โ 17392527 โ
โโโโโโโโโโโโโโโค
โ 17392.5 โ
โโโโโโโโโโโโโโโค
โ 1739.5 โ
โโโโโโโโโโโโโโโค
โ 173925 โ
โโโโโโโโโโโโโโโค
โ 17.5 โ
โโโโโโโโโโโโโโโค
โ 174 โ
โโโโโโโโโโโโโโโค
โ 173925271.5 โ
โโโโโโโโโโโโโโโค
โ 1739252.5 โ
โโโโโโโโโโโโโโโ
real 0m6.457s
user 0m30.310s
sys 0m0.690s
root@localhost:/data/data/com.termux/files/home#
root@localhost:/data/data/com.termux/files/home# time python3 -m chdb "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number))" Pretty
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
โโโโโโโโโโโโโโโ โ avg(number) โ
โกโโโโโโโโโโโโโโฉ
โ 2 โ
โโโโโโโโโโโโโโโค
โ 658113883.5 โ
โโโโโโโโโโโโโโโค
โ 17392527 โ โโโโโโโโโโโโโโโค
โ 17392.5 โ
โโโโโโโโโโโโโโโค
โ 1739.5 โ
โโโโโโโโโโโโโโโค
โ 173925 โ
โโโโโโโโโโโโโโโค
โ 17.5 โ
โโโโโโโโโโโโโโโค
โ 174 โ
โโโโโโโโโโโโโโโค
โ 173925271.5 โ
โโโโโโโโโโโโโโโค
โ 1739252.5 โ
โโโโโโโโโโโโโโโ
real 0m12.387s
user 1m13.140s
sys 0m1.580s
root@localhost:/data/data/com.termux/files/home# time ./ck235 --local -q "SELECT avg(number) FROM numbers_mt(1,1000000000) group by round(log10(number));"
<jemalloc>: Number of CPUs detected is not deterministic. Per-CPU arena disabled.
2
658113883.5
17392527
17392.5
1739.5
173925
17.5
174
173925271.5
1739252.5
real 0m9.979s
user 1m11.570s
sys 0m0.590s
(you don't have to strictly follow this form)
Describe the situation SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i)); chdb costs 400s, clickhouse local costs 100s How to reproduce
CREATE TABLE
statements for all tables involvedselect number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet';
SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i));
Expected performance What are your performance expectation, why do you think they are realistic? Has it been working faster in older ClickHouse releases? Is it working faster in some specific other system? I hope chdb runs as fast as clickhouse local. Additional context Add any other context about the problem here. btw select number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet'; chdb runs as fast as clickhouse local