datafuselabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.31k stars 704 forks source link

Support fulltext index #3915

Closed wubx closed 4 months ago

wubx commented 2 years ago

Discussed in https://github.com/datafuselabs/databend/discussions/3899

Originally posted by **wubx** January 19, 2022 Why we need fulltest index : 1. In APM a lot of query like : where url like "%/user/%" 2. The Game system of the service a lot query like : where action like "%win%" 3. Sample fulltext search replace Es Clickhouse support fulltext index use : ``` ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed) Stores a Bloom filter that contains all ngrams from a block of data. Works only with datatypes: String, FixedString and Map. Can be used for optimization of EQUALS, LIKE and IN expressions. n — ngram size, size_of_bloom_filter_in_bytes — Bloom filter size in bytes (you can use large values here, for example, 256 or 512, because it can be compressed well). number_of_hash_functions — The number of hash functions used in the Bloom filter. random_seed — The seed for Bloom filter hash functions. tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed) The same as ngrambf_v1, but stores tokens instead of ngrams. Tokens are sequences separated by non-alphanumeric characters. ```
b41sh commented 2 years ago

/assignme

BohuTANG commented 2 years ago

Seems we can use tanivy, and store the full text in our fuse engine? https://github.com/quickwit-oss/tantivy

Example: https://github.com/quickwit-oss/tantivy/tree/main/examples

cc @b41sh

b41sh commented 2 years ago

Seems we can use tanivy, and store the full text in our fuse engine? https://github.com/quickwit-oss/tantivy

Example: https://github.com/quickwit-oss/tantivy/tree/main/examples

cc @b41sh

It's a good idea, we can try it.

sunisdown commented 1 year ago

@b41sh @BohuTANG Any plans for development?

wubx commented 1 year ago

https://github.com/zhihu/rucene

b41sh commented 1 year ago

@b41sh @BohuTANG Any plans for development?

I will start working on this issue this week.

BohuTANG commented 1 year ago

Also cc @zhang2014

BohuTANG commented 1 year ago

Benchmark

Databend

Table:

create table liket(c STRING);
insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824  rows
insert into liket select * from liket;
select count(*) from liket where c like '%kd%';
+------------+
| count()    |
+------------+
| 1073741824 |
+------------+
1 row in set (22.50 sec)
Read 1073741824 rows, 38.00 GiB in 22.392 sec., 47.95 million rows/sec., 1.70 GiB/sec.

ClickHouse with ngram index

Table:

CREATE TABLE liket
(
    `c` String,
    INDEX fulltext_index c TYPE ngrambf_v1(3, 1024, 2, 0) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY c;

insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824  rows
insert into liket select * from liket;
thinkpad :) select count(*) from liket where c like '%kd%';

SELECT count(*)
FROM liket
WHERE c LIKE '%kd%'

Query id: dce045a0-175b-4b6d-9751-0702fb3cb115

┌────count()─┐
│ 1073741824 │
└────────────┘

1 rows in set. Elapsed: 4.904 sec. Processed 1.07 billion rows, 41.88 GB (218.97 million rows/s., 8.54 GB/s.)

cc @sundy-li @b41sh

sundy-li commented 1 year ago

How about this sql : select ignore( c) from liket;

I think the bottleneck is mostly on IO read.

BohuTANG commented 1 year ago

@sundy-li

select ignore(c) from liket where c like '%kd%';

or

select ignore( c) from liket;

have a hang there :/

This test is easy to run on your local, I think there are some issues with the query.

sundy-li commented 1 year ago

Test on my laptap, the bottleneck is mostly on IO read & regexp match I think. ClickHouse:

CREATE TABLE liket_noindex (c String) ENGINE = MergeTree ORDER BY c;

insert into liket  select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648);  -- 30.81 MB/s
insert into liket_noindex  select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648); -- 245.13 MB/s

-- Actually the full index did not help this query, because it can't prune any block.
select count(*) from liket where c like '%kd%';  -- 0.846 sec
select count(*) from liket_noindex where c like '%kd%'; -- 1.053 sec 

select count(ignore(c)) from liket; -- 0.434 sec
select count(ignore(c)) from liket_noindex; -- 0.552 sec
sundy-li commented 1 year ago

In memory bench.

databend :) select sum( to_string(number) like '98%') from numbers(5000000000);
+-------------------------------------+
| sum((to_string(number) like '98%')) |
+-------------------------------------+
|                            11111111 |
+-------------------------------------+
1 row in set (6.52 sec)
Read 5000000000 rows, 37.25 GiB in 6.517 sec., 767.27 million rows/sec., 5.72 GiB/sec.

databend :) select sum( to_string(number) like '%98%') from numbers(5000000000);

+--------------------------------------+
| sum((to_string(number) like '%98%')) |
+--------------------------------------+
|                            389599750 |
+--------------------------------------+
1 row in set (17.32 sec)
Read 5000000000 rows, 37.25 GiB in 17.314 sec., 288.78 million rows/sec., 2.15 GiB/sec.
clickhouse :) select sum( toString(number) like '98%') from numbers_mt(5000000000);

SELECT sum(toString(number) LIKE '98%')
FROM numbers_mt(5000000000)

Query id: de75caf9-57b2-4e37-b258-489b76250241

┌─sum(like(toString(number), '98%'))─┐
│                           11111111 │
└────────────────────────────────────┘

1 rows in set. Elapsed: 7.602 sec. Processed 5.00 billion rows, 40.00 GB (657.74 million rows/s., 5.26 GB/s.)

clickhouse :)  select sum( toString(number) like '%98%') from numbers_mt(5000000000);

┌─sum(like(toString(number), '%98%'))─┐
│                           389599750 │
└─────────────────────────────────────┘

1 rows in set. Elapsed: 6.529 sec. Processed 5.00 billion rows, 40.00 GB (765.77 million rows/s., 6.13 GB/s.)
BohuTANG commented 1 year ago

Good. Does the like function in databend have some improvement room?

sundy-li commented 1 year ago

Yes, need profile investigation.

sunisdown commented 1 year ago

Do we have RFC for the fulltext index? tanivy is a good choice and we can discuss about the design.

b41sh commented 1 year ago

Do we have RFC for the fulltext index? tanivy is a good choice and we can discuss about the design.

tantivy only supports fs storage of data, we cannot use it directly. Maybe we need to implement an inverted index ourselves to support full-text search.

sundy-li commented 1 year ago

Do we have RFC for the fulltext index?

A simple RFC in Chinese.

Databend 全文索引.pdf

sunisdown commented 1 year ago

Do we have RFC for the fulltext index?

A simple RFC in Chinese.

Databend 全文索引.pdf

Prefer the second solution, the first solution looks like another quickwit.

ariesdevil commented 1 year ago

ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/

sunisdown commented 1 year ago

The implementation is interesting in that it uses an ID to join the data from the Clickhouse query to the data from the Quickwit query. I think it can be used with a small amount of data.

ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/

BohuTANG commented 1 year ago

Tracking at #9811

ariesdevil commented 1 year ago

ClickHouse uses quickwit implements full-text search. clickhouse.com/docs/en/guides/developer/full-text-search

ClickHouse delete this content and redirect this link to the new full-text-indexing feature, lol.

BohuTANG commented 1 year ago

qdrant supports both fulltext index and vector index: https://qdrant.tech/documentation/indexing/

BohuTANG commented 4 months ago

Move to https://github.com/datafuselabs/databend/issues/14505