refactor(storage): improve inverted index read fst file first to reduce load index

b41sh commented 2 months ago

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduce a new search function instead of tantivy index searcher, and follow the process below to perform the inverted index query search:

Read the fst(Finite State Transducer) first, check if the term in the query matches, return if it doesn't matched.
Read the term dict to get the postings_range in idx and the positions_range in pos for each terms.
Read the doc_ids and term_freqs in idx for each terms using postings_range.
If it's a phrase query, read the position of each terms in pos using positions_range.
Collect matched doc ids using term-related informations.

If the term does not match, only the fst file needs to be read. Since most of the blocks are this case, and the fst is usually only one-tenth the size of the entire index data, it can greatly speed up queries.

If the term matches, only the idx and pos data of the related terms need to be read instead of all the idx and pos data. The size of those datas are so small that they can all be cached in memory, which will speeding up following queries.

This PR mainly optimizes the query performance of inverted index for String type. The function of calculating the score and searching JSON type has not been implemented in this PR, so relevant tests have been temporarily modified, those functions will be implemented in the following PRs.

The inverted index data is stored in a new file format and split data by columns to facilitate reading related fields as required. The schema information is also stored in footer for future expansion. Previous data format reads are also compatible and can continue to be used.

┌─────┐ ┌──────┐ ┌─────┐ ┌─────┐ ┌───────────┐ ┌────────┐ ┌─────────┐ ┌────────────┐ ┌──────────┐
│ fst │ │ term │ │ idx │ │ pos │ │ fieldnorm │ │ schema │ │ offsets │ │ schema_len │ │ meta_len │
└─────┘ └──────┘ └─────┘ └─────┘ └───────────┘ └────────┘ └─────────┘ └────────────┘ └──────────┘
 \                                          /   \                                              /
  \                    ___________________ /     \             _______________________________/
   \                  /                           \           /
   index columns datas                             footer meta

create a table pmc100 on my local environment and run some sqls for tests.

CREATE TABLE pmc100 (
  name VARCHAR NULL,
  journal VARCHAR NULL,
  date VARCHAR NULL,
  volume VARCHAR NULL,
  issue VARCHAR NULL,
  accession VARCHAR NULL,
  timestamp TIMESTAMP NULL,
  pmid VARCHAR NULL,
  body VARCHAR NULL
);

CREATE INVERTED INDEX idx1 on pmc100(name, body);

COPY INTO pmc100 FROM 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = NDJSON);

MySQL [(none)]> select count(*) from pmc100;
+----------+
| COUNT(*) |
+----------+
|   574199 |
+----------+
1 row in set (0.026 sec)

old version

MySQL [(none)]> select count(*) from pmc100 where query('name:Crystallogr');
+----------+
| COUNT(*) |
+----------+
|    25135 |
+----------+
1 row in set (0.535 sec)

MySQL [(none)]> select count(*) from pmc100 where query('name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"');
+----------+
| COUNT(*) |
+----------+
|       93 |
+----------+
1 row in set (1.049 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone');
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+
1 row in set (0.374 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone Hadjoudis');
+----------+
| COUNT(*) |
+----------+
|       82 |
+----------+
1 row in set (0.327 sec)

new version

MySQL [(none)]> select count(*) from pmc100 where query('name:Crystallogr');
+----------+
| COUNT(*) |
+----------+
|     3240 |
+----------+
1 row in set (0.142 sec)

MySQL [(none)]> select count(*) from pmc100 where query('name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"');
+----------+
| COUNT(*) |
+----------+
|       93 |
+----------+
1 row in set (0.038 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone');
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+
1 row in set (0.170 sec)

MySQL [(none)]> select count(*) from pmc100 where query('body:Benzaldehydehydrazone Hadjoudis');
+----------+
| COUNT(*) |
+----------+
|       82 |
+----------+
1 row in set (0.047 sec)

We can see that the execution time has been greatly reduced

0.535 sec -> 0.142 sec 'name:Crystallogr' 1.049 sec -> 0.038 sec 'name:"Acta_Crystallogr_D_Biol_Crystallogr_2014"' 0.374 sec -> 0.170 sec 'body:Benzaldehydehydrazone' 0.327 sec -> 0.047 sec 'body:Benzaldehydehydrazone Hadjoudis'

fixes: #[Link the issue here]

Tests

[ ] Unit Test
[x] Logic Test
[ ] Benchmark Test
[ ] No Test - Explain why

Type of change

[ ] Bug Fix (non-breaking change which fixes an issue)
[ ] New Feature (non-breaking change which adds functionality)
[ ] Breaking Change (fix or feature that could cause existing functionality not to work as expected)
[ ] Documentation Update
[ ] Refactoring
[x] Performance Improvement
[ ] Other (please describe):

This change is

github-actions[bot] commented 2 months ago

Docker Image for PR

tag: pr-16385-2023755-1725425088

note: this image tag is only available for internal use, please check the internal doc for more details.

BohuTANG commented 1 month ago

This PR mainly optimizes the query performance of inverted index for String type.

How much has the query performance improved for the inverted index on String type in this PR?

b41sh commented 1 month ago

This PR mainly optimizes the query performance of inverted index for String type.

How much has the query performance improved for the inverted index on String type in this PR?

No tests yet, I will add performance test results later.

databendlabs / databend