datafuselabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.29k stars 701 forks source link

feat(query): inverted index use empty position data when query not contain phrase terms #15362

Closed b41sh closed 2 weeks ago

b41sh commented 2 weeks ago

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

In the files of the inverted index, the position file records the position of each term in the original text, which is used to judge whether the terms are adjacent to each other when searching for phrases. Position files are usually large in size, which leads to a long time of reading the data of the inverted index and affects the query speed. For example, the size of each file of a 55M inverted index data is as follows:

  1. positions file 37.8M
  2. postings file 10.7M
  3. terms file 6.3M
  4. store file 19.2K
  5. field_norms file 19.1K
  6. fast file 0.1K
  7. meta.json file 0.7K
  8. managed.json file 0.2K

We can see that the positions file takes up 68% of the total size of all the files, which is the main reason for the slow speed of reading the index data. Since the positions file is only used when querying for phrases, it is not used for other queries. We can choose not to read the positions file when the querying don't contain phrase terms, and use an empty positions file instead, which can greatly speed up the query. After testing, we found that the query time was reduced by about 50%.

for example

mysql> CREATE TABLE pmc20 (
    ->   name VARCHAR NULL,
    ->   journal VARCHAR NULL,
    ->   date VARCHAR NULL,
    ->   volume VARCHAR NULL,
    ->   issue VARCHAR NULL,
    ->   accession VARCHAR NULL,
    ->   timestamp TIMESTAMP NULL,
    ->   pmid VARCHAR NULL,
    ->   body VARCHAR NULL
    -> );
Query OK, 0 rows affected (0.10 sec)

mysql> create INVERTED INDEX idx1 on pmc20(name, journal, accession, body);
Query OK, 0 rows affected (0.06 sec)

mysql> COPY INTO pmc20 FROM 'fs:///data2/b41sh/bench/documents.json' FILE_FORMAT = (type = NDJSON);
+----------------------------------+-------------+-------------+-------------+------------------+
| File                             | Rows_loaded | Errors_seen | First_error | First_error_line |
+----------------------------------+-------------+-------------+-------------+------------------+
| data2/b41sh/bench/documents.json |      574199 |           0 | NULL        |             NULL |
+----------------------------------+-------------+-------------+-------------+------------------+
1 row in set (11 min 1.82 sec)
Read 1148398 rows, 43.02 GiB in 661.731 sec., 1.74 thousand rows/sec., 66.57 MiB/sec.

## old query
mysql> select count(*) from pmc20 where query('body:test');
+----------+
| count(*) |
+----------+
|   347849 |
+----------+
1 row in set (2.53 sec)
Read 574199 rows, 0.00 B in 2.485 sec., 231.09 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:glioblastoma');
+----------+
| count(*) |
+----------+
|    10955 |
+----------+
1 row in set (1.79 sec)
Read 574199 rows, 0.00 B in 1.706 sec., 336.6 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:"third canonical disulfide bridge"');
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (1.69 sec)
Read 3039 rows, 0.00 B in 1.619 sec., 1.88 thousand rows/sec., 0.00 B/sec.

## new query
mysql> select count(*) from pmc20 where query('body:test');
+----------+
| count(*) |
+----------+
|   347849 |
+----------+
1 row in set (1.37 sec)
Read 574199 rows, 0.00 B in 1.301 sec., 441.49 thousand rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:glioblastoma');
+----------+
| count(*) |
+----------+
|    10955 |
+----------+
1 row in set (0.62 sec)
Read 574199 rows, 0.00 B in 0.542 sec., 1.06 million rows/sec., 0.00 B/sec.

mysql> select count(*) from pmc20 where query('body:"third canonical disulfide bridge"');
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (1.52 sec)
Read 3039 rows, 0.00 B in 1.435 sec., 2.12 thousand rows/sec., 0.00 B/sec.

part of #14825

Tests

Type of change


This change isโ€‚Reviewable

BohuTANG commented 2 weeks ago

Some panic during the tests :/