To prove my guess, I did the LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 before running Q.23
But why chdb don't link jemalloc? it's a problem I didn't dig deep enough:
When just link jemalloc into the _chdb.cpython-xxxxx.so import will got:
ImportError: /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so: cannot allocate memory in static TLS block
Q.28: SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM file("hits_*.parquet", Parquet) WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
got:
Q.28 is mainly a regex performance problem, chdb eats all the 16 cores of c6a.4x. Time consumption of chdb and its cousin clickhouse-local are quite identical. But Duckdb run this really fast. I think this might be 2 explains:
re2 lib version or optimization issue.
As we know clickhouse engine didn't use the min, max data in every parquet file. This might get too much lines REGEXP_REPLACE.
Performance on c6a.metal is really good, but on c6a.4xlarge is not good, I did some analysis. SQLs on clickbench is start from 0.
So queries below on line N is
Q.N-1
on clickbench.Q.23
Q.23:
SELECT * FROM file("hits_*.parquet", Parquet) WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;
got:To prove my guess, I did the
LD_PRELOAD=/usr/lib64/libtcmalloc.so.4
before running Q.23But why chdb don't link jemalloc? it's a problem I didn't dig deep enough:
When just link jemalloc into the
_chdb.cpython-xxxxx.so
import will got:ImportError: /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so: cannot allocate memory in static TLS block
As the performance impact is so much, I should solve this. Maybe just follow https://github.com/jemalloc/jemalloc/issues/1237
Q.28
Q.28:
SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM file("hits_*.parquet", Parquet) WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
got:Q.28 is mainly a regex performance problem, chdb eats all the 16 cores of c6a.4x. Time consumption of chdb and its cousin clickhouse-local are quite identical. But Duckdb run this really fast. I think this might be 2 explains:
re2
lib version or optimization issue.min
,max
data in every parquet file. This might get too much linesREGEXP_REPLACE
.Tips: The clickhouse parquet file handling issue might be solved in
v23.4
https://twitter.com/ClickHouseDB/status/1649085317000105985?s=20I also expect that the huge performance gap between clickhouse engine and duckdb on Q36, Q37, Q38, Q39 will also be greatly improved in v23.4
Screenshot above is from clickbench
As we can expect the
LD_PRELOAD=/usr/lib64/libtcmalloc.so.4
didn't improve Q28 too much.Here is the raw test result of chdb on c6a.4xlarge: