Open mdianjun opened 3 years ago
• 多列联合索引 & 表达式索引 • 函数下推 • In Set Clause下推 • 多值索引 & 字典索引 • 高压缩比 1:1 vs lucene 8.7 • 向量化构建 4X vs lucene 8.7
lucene c++ lib: https://github.com/luceneplusplus/LucenePlusPlus
它的api类似于java版的lucene的api,只能参考lucene java api:https://lucene.apache.org/core/documentation.html
full-text search support:
关于tantivy的一些资料:
Rust语言:
elasticsearch lucene内部原理: https://zhuanlan.zhihu.com/p/33671444
由于aliyun那篇文章里测试数据集都没有放开,构造符合sql的数据集比较费劲,所以暂时看看text search相关的benchmark,再弄到ck里来测。
基于tantivy的ck与lucene的benchmark测试结果:http://centos04:8080/
TLDR:
下面参考tantivy benchmark准备下测试:
make corpus
下载数据集(7.7G)
原始数据两列:sting类型的ID,和string类型的text文本,由于tantivy现在的表只支持UInt64类型的两个id和一个string类型的body所以借助于临时表转换下
CREATE TABLE corpus_origin
(
`id` String,
`text` String,
) ENGINE MergeTree()
ORDER BY id;
tantivy表:
CREATE TABLE corpus
(
primary_id UInt64,
secondary_id UInt64,
body String
)
ENGINE = Tantivy('/var/lib/clickhouse/tantivy/corpus')
# load data
wc -l corpus.json
cat corpus.json | clickhouse-client -m -q "INSERT INTO corpus_origin FORMART JSONEachRow"
select count() from corpus_origin;
-- data transform and load into target table
INSERT INTO corpus SELECT cityHash64(id) as primary_id, rand32(0) as secondary_id, text as body from corpus_origin;
另外一个多模数据库:https://github.com/arangodb/arangodb ,全文检索实现比较简单:https://github.com/arangodb/arangodb/issues/1796