Open CasperWA opened 3 years ago
This for looking into this.
The first line will create a GIN index, which will be used when searching in the top-level of the JSON document (i.e., all extras keys). It will not be used when querying in nested values, like e.g.:
How about creating an index like:
CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade'));
would this enable general indexed querying under the 'optimade'
key?
My original concern was mainly related to the daily operations of the database. When the number of nodes gets large (mine has 400,000+ nodes and growing), queries for finding active/running processes can take some time (several seconds) using the verdi process list
.
Also, my postgres
only uses a small amount of memory (WSL1 Ubuntu 18.04). The verdi process list
does seem to go faster the second time, maybe this is because some data have been loaded in the memory?
For well-defined routine operations such as finding nodes with certain attributes, e.g. verdi process list
or finding certain composition in the extras
, would it be useful to explore other index types such as btree and hash?
(...) would it be useful to explore other index types such as btree and hash?
As far as I remember, @giovannipizzi tried this as well (at least btree), but found that this was not optimized for the particular use case.
(...) How about creating an index like:
CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade'));
would this enable general indexed querying under the
'optimade'
key?
This would not make all "embedded" values optimized for querying, as far as I know. Perhaps unless it's defined in a special way? I might misremember, but I think @giovannipizzi already tried this kind of GIN index, and it's indeed one of the possible solutions, but one must remember that the index also stores all the data again, hence if the specific extras are too large, it doesn't really bring about a large increase in performance.
My original concern was mainly related to the daily operations of the database. When the number of nodes gets large (mine has 400,000+ nodes and growing), queries for finding active/running processes can take some time (several seconds) using the
verdi process list
.
So you mean that simply because you're storing Extras, then verdi process list
becomes slow. Or even without storing Extras?
Also, my
postgres
only uses a small amount of memory (WSL1 Ubuntu 18.04). Theverdi process list
does seem to go faster the second time, maybe this is because some data have been loaded in the memory?
Yeah, I guess there's some caching going on, but I would refer to @giovannipizzi's more expert knowledge on PostgreSQL in general.
I just want to add a follow-up. I recently migrated my WSL1 setup to WSL2 and now the speed for verdi process list
seemed to improve a lot. If I do it twice the second time is a lot faster than the first. This is probably due to the improved file system performance and Linux page cache is enabled, also I am now running PosgreSQL12 on Ubuntu 20.04 rather than PostgreSQL 10 on Ubuntu 18.04.
PS: If I drop the page cache (to release memory to the windows side) then verdi process list
gets slower again - so it was indeed benefited from that.
Currently, DB queries on JSONB fields (Node attributes and extras) can be quite time consuming if there are many Nodes and if one has nested values. Extras are useful for storing metadata about the Node, and are as such used extensively by some to easily query the DB and retrieve relevant Nodes (mentioning @zhubonan as he raised this issue already on the slack channel). However, since the JSONB fields represent a single column value for a row in the DB table, each of the nested keys are not indexed resulting in slow query times.
This issue represents an investigation into how this may be improved, mainly conducted by @giovannipizzi.
Two solutions were investigated:
Both present pros and cons.
1. GIN
Relevant reference material:
To create a GIN index one can do one of the following:
or
where Nodes have an extra:
{"optimade": {"elements": ["C", "H"]}}
or similar, i.e., a nested extra.The first line will create a GIN index, which will be used when searching in the top-level of the JSON document (i.e., all extras keys). It will not be used when querying in nested values, like e.g.:
The second line will create a GIN index only usable for querying in the nested
extras.optimade.elements
values. I.e., the previous query would use this index. Furthermore, the second line uses the non-default operator classjsonb_path_ops
, which allows one to only use the$>
operator, i.e., the operators?
,?&
, and?|
do not work with indexes created using thejsonb_path_ops
operator class. However, it will be faster than the default operator class when using the$>
operator.As a test, 2 indexes were created on a DB with more than 4 million StructureData Nodes, which all have a single extra
"optimade"
, which consists of several keys and values. The indexes are:Some simple queries can now be tested:
This shows a dramatic improvement when using the index, but also the strictness of how the query needs to be written in order for the index to be properly utilized.
When running the first two queries without a
LIMIT
and analyzing it, it becomes clear that the actual query takes ms, but most of the time is spent to do a recheck:Total time: 24.4 s (Index search (
Bitmap Index Scan
-part): <0.001 ms + Recheck (Bitmap Heap Scan
-part): 24.3 s). See theactual time
parts.The following shows that the index is not used when expressing the query differently:
Total time: 98.2 s
24 s is definitely not satisfactory. According to the reference StackOverflow QA the recheck is done when the working memory of the DB is exceeded and cannot contain the matching data and instead stores the pages in which they are found. A solution could thereby be to reduce the amount of data that needs to be loaded. Another solution could be to allocate more working memory. Reducing the amount of data can be done by either removing unnecessary data in the extras or, since the query will have to load every row, create a new DB table that only contains the extras JSONB column. This is the next to be investigated.
First, since the main issue is when all entries need to be investigated (e.g., when doing a
count
or retrieving columns/data from a query on all possible rows/entries) the following investigations are done with the assumption all data/rows/entries need to be checked. In other words, the query speed is not a major issue when a reasonable number is used forLIMIT
(orOFFSET
, technically, if set to sufficiently high number). This is also demonstrated above for the initial queries usingLIMIT 10000
.2. New DB table
A new DB table is created with a single column containing the nested
extras.optimade.elements
, as well as Node PK and indexed using the following commands:Resulting table overview:
Total time: 0.4 s
HUGE improvement from the previous 24 s ! However, what if we now need other data (like
UUID
and another extra not stored in the special table, etc.) from the related Nodes. I.e., let's make aJOIN
with thedb_dbnode
table.Total time: 12 s
If we use
LIMIT
andOFFSET
:Total time: 1.9 s
In summary, creating a different table hosting the extras one knows will bear the brunt of queries will improve query times. However, for a 4+ million Node DB the time is still not optimal (12 s). If, however, one only needs to retrieve a subset of the results, e.g., for a REST API where only a handful of entries are returned per page, this will work fine (1.9 s). Indeed, one doesn't even need to create a separate table. Instead, a well-written GIN index that indexes the extras that bear the brunt of queries will suffice (<1 s), and is actually to be preferred in this instance.
But it's also worth noting that the gain with the GIN index is only around 5x (from 3.7 s to 0.7 s). Since the index stores copies of all the values that it indexes, it is not advisable to create an index for large valued extras. And combined with the small time gain, it might not be worth it to create a GIN index, since the
count
(orLIMIT ALL
) is still taking too long with the GIN index (24.4 s or 12 s if usingcount
and a separate table).Further testing is needed, and the final solution for a user may be very dependent on the particular use case.