clickhouse与elasticsearch融合

mdianjun commented 3 years ago

godliness commented 3 years ago

云数据库ClickHouse二级索引-最佳实践

• 多列联合索引 & 表达式索引 • 函数下推 • In Set Clause下推 • 多值索引 & 字典索引 • 高压缩比 1:1 vs lucene 8.7 • 向量化构建 4X vs lucene 8.7

mdianjun commented 3 years ago

lucene c++ lib: https://github.com/luceneplusplus/LucenePlusPlus

它的api类似于java版的lucene的api，只能参考lucene java api：https://lucene.apache.org/core/documentation.html

mdianjun commented 3 years ago

full-text search support:

关于tantivy的一些资料：

tantivy与其他搜索引擎的benchmark对比：https://tantivy-search.github.io/bench/
tantivy-cli工具，用于创建索引和搜索：https://github.com/tantivy-search/tantivy-cli
可以了解该文章前面部分对tantivy的使用方法：https://jstrong.dev/posts/2020/building-a-site-search-with-tantivy/
api文档：https://tantivy-search.github.io/tantivy/tantivy/index.html
benchmark项目：https://github.com/tantivy-search/search-benchmark-game

Rust语言：

Rust 语言文档：https://prev.rust-lang.org/zh-CN/documentation.html
标准库：https://doc.rust-lang.org/std/alloc/index.html
教程：https://www.runoob.com/rust/rust-tutorial.html
Rust 程序设计语言：https://kaisery.github.io/trpl-zh-cn/title-page.html

godliness commented 3 years ago

elasticsearch lucene内部原理: https://zhuanlan.zhihu.com/p/33671444

Cas-pian commented 3 years ago

由于aliyun那篇文章里测试数据集都没有放开，构造符合sql的数据集比较费劲，所以暂时看看text search相关的benchmark，再弄到ck里来测。

基于tantivy的ck与lucene的benchmark测试结果：http://centos04:8080/

TLDR: 下面参考tantivy benchmark准备下测试： make corpus下载数据集（7.7G)

注意该数据源位于dropbox，需梯子；

原始数据两列：sting类型的ID，和string类型的text文本，由于tantivy现在的表只支持UInt64类型的两个id和一个string类型的body所以借助于临时表转换下

CREATE TABLE corpus_origin
(
  `id` String, 
  `text` String,
) ENGINE MergeTree()
ORDER BY id;

tantivy表：

CREATE TABLE corpus
(
    primary_id UInt64,
    secondary_id UInt64,
    body String
)
ENGINE = Tantivy('/var/lib/clickhouse/tantivy/corpus')

# load data
wc -l corpus.json
cat corpus.json | clickhouse-client -m -q "INSERT INTO corpus_origin FORMART JSONEachRow"

select count() from corpus_origin;
-- data transform and load into target table
INSERT INTO corpus SELECT cityHash64(id) as primary_id, rand32(0) as secondary_id, text as body from corpus_origin;

Cas-pian commented 3 years ago

在Apache Pinot里实现全文检索：https://medium.com/apache-pinot-developer-blog/text-analytics-on-apache-pinot-cbf5c45d282c

Cas-pian commented 3 years ago

另外一个多模数据库：https://github.com/arangodb/arangodb ，全文检索实现比较简单:https://github.com/arangodb/arangodb/issues/1796

cloudnativecube / octopus

clickhouse与elasticsearch融合 #93