Closed wubx closed 8 months ago
/assignme
Seems we can use tanivy
, and store the full text in our fuse engine?
https://github.com/quickwit-oss/tantivy
Example: https://github.com/quickwit-oss/tantivy/tree/main/examples
cc @b41sh
Seems we can use
tanivy
, and store the full text in our fuse engine? https://github.com/quickwit-oss/tantivyExample: https://github.com/quickwit-oss/tantivy/tree/main/examples
cc @b41sh
It's a good idea, we can try it.
@b41sh @BohuTANG Any plans for development?
@b41sh @BohuTANG Any plans for development?
I will start working on this issue this week.
Also cc @zhang2014
Table:
create table liket(c STRING);
insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824 rows
insert into liket select * from liket;
select count(*) from liket where c like '%kd%';
+------------+
| count() |
+------------+
| 1073741824 |
+------------+
1 row in set (22.50 sec)
Read 1073741824 rows, 38.00 GiB in 22.392 sec., 47.95 million rows/sec., 1.70 GiB/sec.
Table:
CREATE TABLE liket
(
`c` String,
INDEX fulltext_index c TYPE ngrambf_v1(3, 1024, 2, 0) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY c;
insert into liket values('abcdefdafdafsfkjdsalkdjfldsaka');
-- insert 1073741824 rows
insert into liket select * from liket;
thinkpad :) select count(*) from liket where c like '%kd%';
SELECT count(*)
FROM liket
WHERE c LIKE '%kd%'
Query id: dce045a0-175b-4b6d-9751-0702fb3cb115
┌────count()─┐
│ 1073741824 │
└────────────┘
1 rows in set. Elapsed: 4.904 sec. Processed 1.07 billion rows, 41.88 GB (218.97 million rows/s., 8.54 GB/s.)
cc @sundy-li @b41sh
How about this sql :
select ignore( c) from liket;
I think the bottleneck is mostly on IO read.
@sundy-li
select ignore(c) from liket where c like '%kd%';
or
select ignore( c) from liket;
have a hang there :/
This test is easy to run on your local, I think there are some issues with the query.
Test on my laptap, the bottleneck is mostly on IO read & regexp match I think. ClickHouse:
CREATE TABLE liket_noindex (c String) ENGINE = MergeTree ORDER BY c;
insert into liket select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648); -- 30.81 MB/s
insert into liket_noindex select 'abcdefdafdafsfkjdsalkdjfldsaka' from numbers_mt(247483648); -- 245.13 MB/s
-- Actually the full index did not help this query, because it can't prune any block.
select count(*) from liket where c like '%kd%'; -- 0.846 sec
select count(*) from liket_noindex where c like '%kd%'; -- 1.053 sec
select count(ignore(c)) from liket; -- 0.434 sec
select count(ignore(c)) from liket_noindex; -- 0.552 sec
In memory bench.
databend :) select sum( to_string(number) like '98%') from numbers(5000000000);
+-------------------------------------+
| sum((to_string(number) like '98%')) |
+-------------------------------------+
| 11111111 |
+-------------------------------------+
1 row in set (6.52 sec)
Read 5000000000 rows, 37.25 GiB in 6.517 sec., 767.27 million rows/sec., 5.72 GiB/sec.
databend :) select sum( to_string(number) like '%98%') from numbers(5000000000);
+--------------------------------------+
| sum((to_string(number) like '%98%')) |
+--------------------------------------+
| 389599750 |
+--------------------------------------+
1 row in set (17.32 sec)
Read 5000000000 rows, 37.25 GiB in 17.314 sec., 288.78 million rows/sec., 2.15 GiB/sec.
clickhouse :) select sum( toString(number) like '98%') from numbers_mt(5000000000);
SELECT sum(toString(number) LIKE '98%')
FROM numbers_mt(5000000000)
Query id: de75caf9-57b2-4e37-b258-489b76250241
┌─sum(like(toString(number), '98%'))─┐
│ 11111111 │
└────────────────────────────────────┘
1 rows in set. Elapsed: 7.602 sec. Processed 5.00 billion rows, 40.00 GB (657.74 million rows/s., 5.26 GB/s.)
clickhouse :) select sum( toString(number) like '%98%') from numbers_mt(5000000000);
┌─sum(like(toString(number), '%98%'))─┐
│ 389599750 │
└─────────────────────────────────────┘
1 rows in set. Elapsed: 6.529 sec. Processed 5.00 billion rows, 40.00 GB (765.77 million rows/s., 6.13 GB/s.)
Good. Does the like function in databend have some improvement room?
Yes, need profile investigation.
Do we have RFC for the fulltext index? tanivy
is a good choice and we can discuss about the design.
Do we have RFC for the fulltext index?
tanivy
is a good choice and we can discuss about the design.
tantivy
only supports fs
storage of data, we cannot use it directly.
Maybe we need to implement an inverted index ourselves to support full-text search.
Do we have RFC for the fulltext index?
A simple RFC in Chinese.
Prefer the second solution, the first solution looks like another quickwit.
ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/
The implementation is interesting in that it uses an ID to join the data from the Clickhouse query to the data from the Quickwit query. I think it can be used with a small amount of data.
ClickHouse uses quickwit implements full-text search. https://clickhouse.com/docs/en/guides/developer/full-text-search/
Tracking at #9811
ClickHouse uses quickwit implements full-text search. clickhouse.com/docs/en/guides/developer/full-text-search↳
ClickHouse delete this content and redirect this link to the new full-text-indexing feature, lol.
qdrant supports both fulltext index and vector index: https://qdrant.tech/documentation/indexing/
Discussed in https://github.com/datafuselabs/databend/discussions/3899