StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.69k stars 1.76k forks source link

Support fulltext search with inverted index #34019

Closed dujijun007 closed 10 months ago

dujijun007 commented 10 months ago

Feature request

Motivation Inverted index is a commonly used data structure in information retrieval systems, which maps the "word-document" relationship to the "document-word" relationship. In an inverted index, each unique word has an associated index pointing to a list of documents containing that word. This indexing method makes document search based on words highly efficient.

In OLAP engines, inverted indexes and information retrieval play a crucial role. OLAP engines are typically used for handling complex analytical queries, which often involve large amounts of data and require aggregation and filtering across multiple dimensions.

Inverted indexes can greatly improve the performance of these queries. For example, if you want to find all data rows containing a specific word, you can directly look up the inverted index instead of scanning the entire dataset. This can significantly optimize the query scan volume, especially when retrieving a small amount of data from massive datasets, resulting in greatly improved query performance.

Currently, several OLAP engines, such as Doris and ClickHouse, have their own implementation of inverted indexes.

Solution

We want to abstract an implementation framework for inverted indexes, and CLucene is the first specific implementation we are trying. flowchart (1)

When we use inverted index, we can use grammar below to manage, write and query index.

// create table with inverted index
CREATE TABLE table_name
(
  columns_difinition,
  INDEX idx_name1(column_name1) USING GIN [PROPERTIES("analyzer" = "english|chinese|none", ...)] [COMMENT 'your comment']
  INDEX idx_name2(column_name2) USING GIN [PROPERTIES("analyzer" = "english|chinese|none", ...)] [COMMENT 'your comment']
)

// create index for table
-- grammar 1
CREATE INDEX idx_name ON table_name(column_name) USING GIN [PROPERTIES("analyzer" = "english|chinese|none", ...)] [COMMENT 'your comment'];
-- grammar 2
ALTER TABLE table_name ADD INDEX idx_name(column_name) USING GIN [PROPERTIES("analyzer" = "english|chinese|none", ...)] [COMMENT 'your comment'];
-- grammar 3 
ALTER TABLE table_name MODIFY INDEX idx_name(column_name)  [PROPERTIES] [COMMENT 'your comment']

// drop index
-- grammar 1
DROP INDEX idx_name ON table_name;
-- grammar 2
ALTER TABLE table_name DROP INDEX idx_name;

// query match inverted index with basic binary comparison operators: =,!=,>,>=,<,<=
select * from table_name where column='xxx'; // match

// furthermore, we can also use the following functions to have complete full-text search ability.
// terms query
select * from table where match_terms(f, ['n1', 'n2']);

// phrase query
select * from table where match_phrase(f1, "n1", "n2", 2);

// wildcard query
select * from table where match_wildcard(f1, "*15*");

// fuzzy query
select * from table where match_fuzzy(f1, "thinkbad", 1);

// range query
select * from table where match_range(f1, 1, 10, false, true);

// boost for query
select * from table where boost(terms_query(name, ['n1']), 10) and [other conditions];
imay commented 10 months ago

@dujijun007 How about naming this index as GIN like postgresql and cockroach?

dujijun007 commented 10 months ago

@dujijun007 How about naming this index as GIN like postgresql and cockroach?

@imay Indeed, GIN is more familiar to developers. I will rename it.