cloudnativecube / octopus

14 stars 2 forks source link

clickhouse与elasticsearch融合 #93

Open mdianjun opened 3 years ago

mdianjun commented 3 years ago
godliness commented 3 years ago

云数据库ClickHouse二级索引-最佳实践

• 多列联合索引 & 表达式索引 • 函数下推 • In Set Clause下推 • 多值索引 & 字典索引 • 高压缩比 1:1 vs lucene 8.7 • 向量化构建 4X vs lucene 8.7

mdianjun commented 3 years ago

lucene c++ lib: https://github.com/luceneplusplus/LucenePlusPlus

它的api类似于java版的lucene的api,只能参考lucene java api:https://lucene.apache.org/core/documentation.html

mdianjun commented 3 years ago

full-text search support:

关于tantivy的一些资料:

Rust语言:

godliness commented 3 years ago

elasticsearch lucene内部原理: https://zhuanlan.zhihu.com/p/33671444

Cas-pian commented 3 years ago

由于aliyun那篇文章里测试数据集都没有放开,构造符合sql的数据集比较费劲,所以暂时看看text search相关的benchmark,再弄到ck里来测。

基于tantivy的ck与lucene的benchmark测试结果http://centos04:8080/

TLDR: 下面参考tantivy benchmark准备下测试: make corpus下载数据集(7.7G)

原始数据两列:sting类型的ID,和string类型的text文本,由于tantivy现在的表只支持UInt64类型的两个id和一个string类型的body所以借助于临时表转换下

CREATE TABLE corpus_origin
(
  `id` String, 
  `text` String,
) ENGINE MergeTree()
ORDER BY id;

tantivy表:

CREATE TABLE corpus
(
    primary_id UInt64,
    secondary_id UInt64,
    body String
)
ENGINE = Tantivy('/var/lib/clickhouse/tantivy/corpus')
# load data
wc -l corpus.json
cat corpus.json | clickhouse-client -m -q "INSERT INTO corpus_origin FORMART JSONEachRow"
select count() from corpus_origin;
-- data transform and load into target table
INSERT INTO corpus SELECT cityHash64(id) as primary_id, rand32(0) as secondary_id, text as body from corpus_origin;
Cas-pian commented 3 years ago

在Apache Pinot里实现全文检索:https://medium.com/apache-pinot-developer-blog/text-analytics-on-apache-pinot-cbf5c45d282c

Cas-pian commented 3 years ago

另外一个多模数据库:https://github.com/arangodb/arangodb ,全文检索实现比较简单:https://github.com/arangodb/arangodb/issues/1796