Closed b41sh closed 1 month ago
pr-16385-2023755-1725425088
note: this image tag is only available for internal use, please check the internal doc for more details.
This PR mainly optimizes the query performance of inverted index for String type.
How much has the query performance improved for the inverted index on String type in this PR?
This PR mainly optimizes the query performance of inverted index for String type.
How much has the query performance improved for the inverted index on String type in this PR?
No tests yet, I will add performance test results later.
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
This PR introduce a new search function instead of
tantivy
index searcher, and follow the process below to perform the inverted index query search:fst
(Finite State Transducer) first, check if the term in the query matches, return if it doesn't matched.term dict
to get thepostings_range
inidx
and thepositions_range
inpos
for each terms.doc_ids
andterm_freqs
inidx
for each terms usingpostings_range
.position
of each terms inpos
usingpositions_range
.If the term does not match, only the
fst
file needs to be read. Since most of the blocks are this case, and thefst
is usually only one-tenth the size of the entire index data, it can greatly speed up queries.If the term matches, only the
idx
andpos
data of the related terms need to be read instead of all theidx
andpos
data. The size of those datas are so small that they can all be cached in memory, which will speeding up following queries.This PR mainly optimizes the query performance of inverted index for String type. The function of calculating the score and searching JSON type has not been implemented in this PR, so relevant tests have been temporarily modified, those functions will be implemented in the following PRs.
The inverted index data is stored in a new file format and split data by columns to facilitate reading related fields as required. The schema information is also stored in footer for future expansion. Previous data format reads are also compatible and can continue to be used.
create a table
pmc100
on my local environment and run some sqls for tests.old version
new version
We can see that the execution time has been greatly reduced
0.535 sec -> 0.142 sec 'name:Crystallogr' 1.049 sec -> 0.038 sec 'name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"' 0.374 sec -> 0.170 sec 'body:Benzaldehydehydrazone' 0.327 sec -> 0.047 sec 'body:Benzaldehydehydrazone Hadjoudis'
fixes: #[Link the issue here]
Tests
Type of change
This change isโ