Optimizing DB queries on Node attributes and extras (JSONB fields)

CasperWA commented 3 years ago

Currently, DB queries on JSONB fields (Node attributes and extras) can be quite time consuming if there are many Nodes and if one has nested values. Extras are useful for storing metadata about the Node, and are as such used extensively by some to easily query the DB and retrieve relevant Nodes (mentioning @zhubonan as he raised this issue already on the slack channel). However, since the JSONB fields represent a single column value for a row in the DB table, each of the nested keys are not indexed resulting in slow query times.

This issue represents an investigation into how this may be improved, mainly conducted by @giovannipizzi.

Two solutions were investigated:

Creating a GIN (general inverted indexes) index for the extras fields one wishes to query with effeciency.
Creating a dedicated DB table to store the extras fields one wishes to query with effeciency.

Both present pros and cons.

1. GIN

Relevant reference material:

To create a GIN index one can do one of the following:

CREATE INDEX db_dbnode_extras_idxgin ON db_dbnode USING GIN (extras);

or

CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade' -> 'elements') jsonb_path_ops);

where Nodes have an extra: {"optimade": {"elements": ["C", "H"]}} or similar, i.e., a nested extra.

The first line will create a GIN index, which will be used when searching in the top-level of the JSON document (i.e., all extras keys). It will not be used when querying in nested values, like e.g.:

SELECT db_node_1.uuid FROM db_dbnode AS db_dbnode_1 WHERE db_dbnode1.extras -> 'optimade' -> 'elements' ? 'C';

The second line will create a GIN index only usable for querying in the nested extras.optimade.elements values. I.e., the previous query would use this index. Furthermore, the second line uses the non-default operator class jsonb_path_ops, which allows one to only use the $> operator, i.e., the operators ?, ?&, and ?| do not work with indexes created using the jsonb_path_ops operator class. However, it will be faster than the default operator class when using the $> operator.

As a test, 2 indexes were created on a DB with more than 4 million StructureData Nodes, which all have a single extra "optimade", which consists of several keys and values. The indexes are:

"db_dbnode_expr_idx" gin (((extras -> 'optimade'::text) -> 'elements'::text) jsonb_path_ops)
"idx_test_gin_extras" gin (extras)

Some simple queries can now be tested:

# Uses index "dbnode_expr_idx", time: 0.7 s
SELECT db_dbnode_1.uuid FROM db_dbnode AS db_dbnode_1 WHERE db_dbnode_1.extras -> 'optimade' -> 'elements' @> '["Si"]' LIMIT 10000;

# Does note use an index, time: 3.7 s
SELECT db_dbnode_1.uuid FROM db_dbnode AS db_dbnode_1 WHERE db_dbnode_1.extras -> 'optimade' @> '{"elements": ["Si"]}' LIMIT 10000;

# Uses index "db_dbnode_expr_idx", time: 0.7 ms
SELECT db_dbnode_1.uuid FROM db_dbnode AS db_dbnode_1 WHERE db_dbnode_1.extras @> '{"optimade": {"elements": ["Si"]}}' LIMIT 10000;

This shows a dramatic improvement when using the index, but also the strictness of how the query needs to be written in order for the index to be properly utilized.

When running the first two queries without a LIMIT and analyzing it, it becomes clear that the actual query takes ms, but most of the time is spent to do a recheck:

EXPLAIN ANALYZE
SELECT db_dbnode_1.uuid FROM db_dbnode AS db_dbnode_1 WHERE db_dbnode_1.extras -> 'optimade' -> 'elements' @> '["Si"]'; 

 Bitmap Heap Scan on db_dbnode db_dbnode_1  (cost=150.92..16662.04 rows=4506 width=16) (actual time=107.680..24375.816 rows=398414 loops=1)
   Recheck Cond: (((extras -> 'optimade'::text) -> 'elements'::text) @> '["Si"]'::jsonb)
   Rows Removed by Index Recheck: 587097
   Heap Blocks: exact=39564 lossy=53314
   ->  Bitmap Index Scan on db_dbnode_expr_idx  (cost=0.00..149.79 rows=4506 width=0) (actual time=97.479..97.479 rows=398414 loops=1)
         Index Cond: (((extras -> 'optimade'::text) -> 'elements'::text) @> '["Si"]'::jsonb)
 Planning time: 0.209 ms
 Execution time: 24396.809 ms
(8 rows)

Total time: 24.4 s (Index search (Bitmap Index Scan-part): <0.001 ms + Recheck (Bitmap Heap Scan-part): 24.3 s). See the actual time parts.

The following shows that the index is not used when expressing the query differently:

EXPLAIN ANALYZE
SELECT db_dbnode_1.uuid FROM db_dbnode AS db_dbnode_1 WHERE db_dbnode_1.extras -> 'optimade' @> '{"elements": ["Si"]}'; 
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on db_dbnode db_dbnode_1  (cost=0.00..434932.12 rows=4506 width=16) (actual time=0.350..98127.709 rows=398414 loops=1)
   Filter: ((extras -> 'optimade'::text) @> '{"elements": ["Si"]}'::jsonb)
   Rows Removed by Filter: 4107394
 Planning time: 0.133 ms
 Execution time: 98151.131 ms
(5 rows)

Total time: 98.2 s

24 s is definitely not satisfactory. According to the reference StackOverflow QA the recheck is done when the working memory of the DB is exceeded and cannot contain the matching data and instead stores the pages in which they are found. A solution could thereby be to reduce the amount of data that needs to be loaded. Another solution could be to allocate more working memory. Reducing the amount of data can be done by either removing unnecessary data in the extras or, since the query will have to load every row, create a new DB table that only contains the extras JSONB column. This is the next to be investigated.

First, since the main issue is when all entries need to be investigated (e.g., when doing a count or retrieving columns/data from a query on all possible rows/entries) the following investigations are done with the assumption all data/rows/entries need to be checked. In other words, the query speed is not a major issue when a reasonable number is used for LIMIT (or OFFSET, technically, if set to sufficiently high number). This is also demonstrated above for the initial queries using LIMIT 10000.

2. New DB table

A new DB table is created with a single column containing the nested extras.optimade.elements, as well as Node PK and indexed using the following commands:

CREATE TABLE test AS (SELECT tab.id as node_id, tab.extras #> '{"optimade","elements"}' as optimade_elements FROM db_dbnode AS tab);
CREATE INDEX ON test USING GIN (optimade_elements jsonb_path_ops);

Resulting table overview:

                               Table "public.test"
      Column       |  Type   | Modifiers | Storage  | Stats target | Description 
-------------------+---------+-----------+----------+--------------+-------------
 node_id           | integer |           | plain    |              | 
 optimade_elements | jsonb   |           | extended |              |

# Uses index to count all entries
EXPLAIN ANALYZE
SELECT count(node_id) FROM test WHERE optimade_elements @> '["Si"]';

 Bitmap Heap Scan on test  (cost=138.92..12546.51 rows=4506 width=4) (actual time=81.409..380.755 rows=398414 loops=1)
   Recheck Cond: (optimade_elements @> '["Si"]'::jsonb)
   Heap Blocks: exact=19918
   ->  Bitmap Index Scan on test_optimade_elements_idx  (cost=0.00..137.79 rows=4506 width=0) (actual time=76.207..76.207 rows=398414 loops=1)
         Index Cond: (optimade_elements @> '["Si"]'::jsonb)
 Planning time: 0.150 ms
 Execution time: 396.707 ms

Total time: 0.4 s

HUGE improvement from the previous 24 s ! However, what if we now need other data (like UUID and another extra not stored in the special table, etc.) from the related Nodes. I.e., let's make a JOIN with the db_dbnode table.

# Uses index on `test` table, then `JOIN`s on the `db_dbnode` table.
# Retrieve UUID and the extra `optimade.chemical_formula_descriptive`, neither of which are stored in the `test` table.
EXPLAIN ANALYZE
SELECT node.uuid, node.extras #> '{"optimade","chemical_formula_descriptive"}' AS chem_formula
FROM db_dbnode AS node JOIN test AS test ON node.id = test.node_id WHERE optimade_elements @> '["Si"]'; 

 Nested Loop  (cost=139.34..49051.66 rows=4505 width=37) (actual time=79.700..11986.818 rows=398414 loops=1)
   ->  Bitmap Heap Scan on test  (cost=138.91..12544.10 rows=4505 width=4) (actual time=79.609..351.576 rows=398414 loops=1)
         Recheck Cond: (optimade_elements @> '["Si"]'::jsonb)
         Heap Blocks: exact=19918
         ->  Bitmap Index Scan on test_optimade_elements_idx  (cost=0.00..137.79 rows=4505 width=0) (actual time=74.497..74.497 rows=398414 loops=1)
               Index Cond: (optimade_elements @> '["Si"]'::jsonb)
   ->  Index Scan using db_dbnode_pkey on db_dbnode node  (cost=0.43..8.09 rows=1 width=41) (actual time=0.003..0.003 rows=1 loops=398414)
         Index Cond: (id = test.node_id)
 Planning time: 0.479 ms
 Execution time: 12006.540 ms
(10 rows)

Total time: 12 s

If we use LIMIT and OFFSET:

# Test retrieving entries that are not in the top of the table (setting `OFFSET` to 390,000).
# Retrieve only the first 20 entries (setting `LIMIT` to 20).
EXPLAIN ANALYZE
SELECT node.uuid AS uuid FROM db_dbnode AS node JOIN test AS test ON node.id = test.node_id WHERE optimade_elements @> '["Si"]' ORDER BY uuid OFFSET 390000 LIMIT 20; 

 Limit  (cost=49325.05..49325.05 rows=1 width=16) (actual time=1894.662..1894.667 rows=20 loops=1)
   ->  Sort  (cost=49313.79..49325.05 rows=4505 width=16) (actual time=1799.448..1884.580 rows=390020 loops=1)
         Sort Key: node.uuid
         Sort Method: external merge  Disk: 10104kB
         ->  Nested Loop  (cost=139.34..49040.40 rows=4505 width=16) (actual time=80.110..1423.637 rows=398414 loops=1)
               ->  Bitmap Heap Scan on test  (cost=138.91..12544.10 rows=4505 width=4) (actual time=80.091..323.782 rows=398414 loops=1)
                     Recheck Cond: (optimade_elements @> '["Si"]'::jsonb)
                     Heap Blocks: exact=19918
                     ->  Bitmap Index Scan on test_optimade_elements_idx  (cost=0.00..137.79 rows=4505 width=0) (actual time=74.994..74.994 rows=398414 loops=1)
                           Index Cond: (optimade_elements @> '["Si"]'::jsonb)
               ->  Index Scan using db_dbnode_pkey on db_dbnode node  (cost=0.43..8.09 rows=1 width=20) (actual time=0.002..0.003 rows=1 loops=398414)
                     Index Cond: (id = test.node_id)
 Planning time: 0.405 ms
 Execution time: 1896.501 ms
(14 rows)

Total time: 1.9 s

In summary, creating a different table hosting the extras one knows will bear the brunt of queries will improve query times. However, for a 4+ million Node DB the time is still not optimal (12 s). If, however, one only needs to retrieve a subset of the results, e.g., for a REST API where only a handful of entries are returned per page, this will work fine (1.9 s). Indeed, one doesn't even need to create a separate table. Instead, a well-written GIN index that indexes the extras that bear the brunt of queries will suffice (<1 s), and is actually to be preferred in this instance.

But it's also worth noting that the gain with the GIN index is only around 5x (from 3.7 s to 0.7 s). Since the index stores copies of all the values that it indexes, it is not advisable to create an index for large valued extras. And combined with the small time gain, it might not be worth it to create a GIN index, since the count (or LIMIT ALL) is still taking too long with the GIN index (24.4 s or 12 s if using count and a separate table).

Further testing is needed, and the final solution for a user may be very dependent on the particular use case.

zhubonan commented 3 years ago

This for looking into this.

The first line will create a GIN index, which will be used when searching in the top-level of the JSON document (i.e., all extras keys). It will not be used when querying in nested values, like e.g.:

How about creating an index like:

CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade'));

would this enable general indexed querying under the 'optimade' key?

My original concern was mainly related to the daily operations of the database. When the number of nodes gets large (mine has 400,000+ nodes and growing), queries for finding active/running processes can take some time (several seconds) using the verdi process list.

Also, my postgres only uses a small amount of memory (WSL1 Ubuntu 18.04). The verdi process list does seem to go faster the second time, maybe this is because some data have been loaded in the memory?

zhubonan commented 3 years ago

For well-defined routine operations such as finding nodes with certain attributes, e.g. verdi process list or finding certain composition in the extras, would it be useful to explore other index types such as btree and hash?

CasperWA commented 3 years ago

(...) would it be useful to explore other index types such as btree and hash?

As far as I remember, @giovannipizzi tried this as well (at least btree), but found that this was not optimized for the particular use case.

(...) How about creating an index like:
CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade'));
would this enable general indexed querying under the 'optimade' key?

This would not make all "embedded" values optimized for querying, as far as I know. Perhaps unless it's defined in a special way? I might misremember, but I think @giovannipizzi already tried this kind of GIN index, and it's indeed one of the possible solutions, but one must remember that the index also stores all the data again, hence if the specific extras are too large, it doesn't really bring about a large increase in performance.

My original concern was mainly related to the daily operations of the database. When the number of nodes gets large (mine has 400,000+ nodes and growing), queries for finding active/running processes can take some time (several seconds) using the verdi process list.

So you mean that simply because you're storing Extras, then verdi process list becomes slow. Or even without storing Extras?

Also, my postgres only uses a small amount of memory (WSL1 Ubuntu 18.04). The verdi process list does seem to go faster the second time, maybe this is because some data have been loaded in the memory?

Yeah, I guess there's some caching going on, but I would refer to @giovannipizzi's more expert knowledge on PostgreSQL in general.

zhubonan commented 3 years ago

I just want to add a follow-up. I recently migrated my WSL1 setup to WSL2 and now the speed for verdi process list seemed to improve a lot. If I do it twice the second time is a lot faster than the first. This is probably due to the improved file system performance and Linux page cache is enabled, also I am now running PosgreSQL12 on Ubuntu 20.04 rather than PostgreSQL 10 on Ubuntu 18.04.

PS: If I drop the page cache (to release memory to the windows side) then verdi process list gets slower again - so it was indeed benefited from that.

aiidateam / aiida-core

Optimizing DB queries on Node attributes and extras (JSONB fields) #4664

1. GIN

2. New DB table