chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

Clickhouse-local and chdb performance issue on clickbench Q.23 Q28 #19

Closed auxten closed 6 months ago

auxten commented 1 year ago

Performance on c6a.metal is really good, but on c6a.4xlarge is not good, I did some analysis. SQLs on clickbench is start from 0.

So queries below on line N is Q.N-1 on clickbench.

Q.23

Q.23: SELECT * FROM file("hits_*.parquet", Parquet) WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10; got:

chdb: [56.04940148999992,51.57905327000003,47.26720601400007]
duckdb: [55.858323720000044,11.130785724999896,11.137916557000153]
clickhouse-local: [56.544, 17.397, 17.414]

To prove my guess, I did the LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 before running Q.23

image

But why chdb don't link jemalloc? it's a problem I didn't dig deep enough:

When just link jemalloc into the _chdb.cpython-xxxxx.so import will got: ImportError: /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so: cannot allocate memory in static TLS block

As the performance impact is so much, I should solve this. Maybe just follow https://github.com/jemalloc/jemalloc/issues/1237

Q.28

Q.28: SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM file("hits_*.parquet", Parquet) WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; got:

chdb: [10.385669515000018,10.016407472000083,10.284191555000007]
duckdb: [8.373684226000023,4.492152146999842,4.521108621999929]
clickhouse-local: [9.724, 9.377, 9.490]

Q.28 is mainly a regex performance problem, chdb eats all the 16 cores of c6a.4x. Time consumption of chdb and its cousin clickhouse-local are quite identical. But Duckdb run this really fast. I think this might be 2 explains:

  1. re2 lib version or optimization issue.
  2. As we know clickhouse engine didn't use the min, max data in every parquet file. This might get too much lines REGEXP_REPLACE.

Tips: The clickhouse parquet file handling issue might be solved in v23.4 https://twitter.com/ClickHouseDB/status/1649085317000105985?s=20

I also expect that the huge performance gap between clickhouse engine and duckdb on Q36, Q37, Q38, Q39 will also be greatly improved in v23.4

image

Screenshot above is from clickbench

As we can expect the LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 didn't improve Q28 too much.

image

Here is the raw test result of chdb on c6a.4xlarge:

  1. [0.4212477300000046,0.05675365600001214,0.05619672399996034],
  2. [0.4166838250000069,0.06531620699996665,0.06449468199997455],
  3. [1.321115270000007,0.11996398300004785,0.11983177199999773],
  4. [1.0048882890000073,0.19388485299998592,0.19680832800003145],
  5. [1.9418000499999835,1.3102912339999762,1.306730918000028],
  6. [1.7881873679999671,0.9799579250000079,0.9898529660000008],
  7. [0.4436242609999681,0.06424746199996889,0.06408176899998352],
  8. [0.4105360549999659,0.07395825700001524,0.07268457200001421],
  9. [1.4529841990000136,0.7407924660000162,0.7251662129999659],
  10. [3.591443298999991,0.908300875000009,0.8936401500000102],
  11. [2.30893248000001,0.3970227440000258,0.4045306449999657],
  12. [2.3094307900000217,0.44978438499998674,0.46010506099997883],
  13. [1.766951419999998,0.8745181759999809,0.8729436739999983],
  14. [3.655738981000013,1.3014796009999827,1.3102514689999794],
  15. [1.7923423009999624,0.9957754339999951,0.9786744300000123],
  16. [1.4113377190000165,0.904729954000004,0.8983218939999915],
  17. [4.241506091000019,2.110677882999994,2.108407218000025],
  18. [3.475192112000002,1.3591108809999923,1.3430213159999767],
  19. [7.629405652000003,4.1000120279999805,4.094294319000028],
  20. [0.9980636189999927,0.1998158640000156,0.20184096600002022],
  21. [11.467297670999983,3.3498150409999994,3.340530508000029],
  22. [12.402329799999961,3.8429053930000237,3.8139943720000247],
  23. [23.43172926599999,7.04715183899998,7.0258233459999815],
  24. [56.04940148999992,51.57905327000003,47.26720601400007],
  25. [3.778993186999969,0.9129657269999143,0.8343545320000203],
  26. [1.5488553679999768,0.7785552640000333,0.7703242139999702],
  27. [3.7575681749999603,0.7797505359999377,0.7788957500000606],
  28. [11.195327173999999,3.9467314390000183,4.043763985999931],
  29. [10.385669515000018,10.016407472000083,10.284191555000007],
  30. [3.109853629999975,2.6357215879999103,2.592264249999971],
  31. [4.27900334200001,1.0691001030000962,1.0309490470000355],
  32. [8.493144112999971,1.4274674520000872,1.41434988900005],
  33. [9.647242985000048,6.089477478999925,5.9719970529999955],
  34. [11.494503093000048,5.471921920999989,5.2813955599999645],
  35. [11.458392625000101,5.404140325000071,5.264485821999983],
  36. [1.4544796239999869,1.1437831629999664,1.1075418030000037],
  37. [13.741583547000005,3.498664385999973,3.5927922500000022],
  38. [12.745886966999933,3.499401634000037,3.489023333999967],
  39. [13.625658656999917,3.5325630399999,3.5357377320000296],
  40. [21.39596359199993,5.756981011999983,5.797184558000026],
  41. [6.0100831690000405,0.6963497419999385,0.6951165990000163],
  42. [6.212279076000073,0.6570028610000236,0.6571199239999714],
  43. [4.017423710000003,0.5795366049999302,0.505997210999908]
auxten commented 1 year ago

ClickHouse v23.4 released https://github.com/ClickHouse/ClickHouse/releases/tag/v23.4.1.1943-stable