Move process_state from JSONB attribute to a dedicated column

Experience with verdi process list shows that queries using the new process_state variable are expensive for large databases. This is because there is no index on the JSONB attributes, and the query must do a sequential scan over these.

Proposed solutions:

Move process_state to a dedicated column with an index.
Add a GIN index to the JSONB (benchmarking needed).

Benchmarking from @sphuber below:

Useful and detailed information on this subject by @giovannipizzi in another thread:

https://github.com/aiidateam/aiida-core/issues/4841#issuecomment-830866937 https://github.com/aiidateam/aiida-core/issues/4841#issuecomment-830867591

Pasted below

There's quite some discussion and debugging on GIN indexes on #4664 done by @CasperWA and me. The main lessons we learnt are:

making a GIN index on everything is not useful and actually bad (you create a huge index, that will fill your disk and in the end not really help speed up your queries)
it's possible to make a GIN index only a specific field. This is in general not something we can do in AiiDA core as we don't know which extras or attributes a user will need to search (but an experienced user could add them). On the other hand, for the process_state, this is something known to AiiDA - so a quick solution, even before trying to move this field to a proper column, would be to add an index just for this.

From #4664, the syntax to create a GIN index only on a specific field is this:

CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade' -> 'elements') jsonb_path_ops);

and, what's even more interesting, you can create a partial index with a WHERE clause (e.g. to only add the index to processes, for instance, to reduce even more the size - note that the index will be used, however, only if the corresponding queries actually have the very same filter - so, for testing, maybe it's better to start with an index without WHERE clause).

The actual command to run should be:

CREATE INDEX idx_process_state ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_ops);

(one should check which operators are actually used, and if we can replace jsonb_ops with the faster but more limited json_path_ops, see bottom of this docs page.)

(Note: when timing, note that there a few last operations that are also slow: checking if there are too many slots used and issuing a warning, and checking when the status was last changed - in timing, you should remove that part; I don't remember if the --raw option does that, or you need to change the source code for timing).

However, we'll need also to revise the query generated by the QueryBuilder. In a simplified form a query could look like this:

select count(id) FROM db_dbnode where 
    CASE WHEN (jsonb_typeof(attributes #> '{process_state}') = 'string') THEN (attributes #>> '{process_state}') = 'finished' ELSE false END;

but (at least on my mid-size DB, with 41k calculations) this does not use the new index (checked with EXPLAIN ANALYZE in front of the query, it does a Seq Scan). I think it's because of the CASE.

@ltalirz you could try to:

run the command to create the index (it should be fast, a matter of a few seconds)
Try if the verdi process list becomes faster
If not, check if you can write a query in SQL that becomes fast and report it here If we don't manage, the other option is what @sphuber mentioned (moving that specific field to a DB column, with an index - that will become very fast). It will require a bit of code refactoring, though - but might be worth anyway, if it turns out that to use the GIN index we need to refactor some other part of the code anyway (the QueryBuilder).

As a further comment, this query:

explain analyze select count(id) FROM db_dbnode where  attributes -> 'process_state' @> '"finished"';

uses the new index, while already this:

explain analyze select count(id) FROM db_dbnode where  attributes @> '{"process_state": "finished"}';

does not - PSQL seems quite picky on using the index only if you are really writing explicitly queries on attributes -> 'process_state', and does not use it for queries on attributes directly, even if in practice the query is the same. @ltalirz you could at least check if the "manual" query (this is a simplified one, not really the one done by verdi process list becomes acceptably fast or not, after adding the GIN index.

Also, this would use the index as well:

explain analyze select count(id) FROM db_dbnode where 
    jsonb_typeof(attributes #> '{process_state}') = 'string' AND attributes -> 'process_state' @> '"finished"';

As a further comment. The "default" query generated by the query builder in verdi process list (with default parameters) is:

SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes #> '{process_state}' AS anon_1, db_dbnode_1.attributes #> '{process_status}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.attributes #> '{sealed}' AS anon_4, db_dbnode_1.attributes #> '{process_label}' AS anon_5, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.node_type, db_dbnode_1.attributes #> '{paused}' AS anon_6, db_dbnode_1.process_type, db_dbnode_1.attributes #> '{state}' AS anon_7, db_dbnode_1.attributes #> '{scheduler_state}' AS anon_8, db_dbnode_1.attributes #> '{exception}' AS anon_9 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.%%' AND CASE WHEN (jsonb_typeof(db_dbnode_1.attributes #> '{process_state}') = 'string') THEN (db_dbnode_1.attributes #>> '{process_state}') IN ('created', 'waiting', 'running') ELSE false END ORDER BY db_dbnode_1.ctime

This does a sequential scan.

This query (that should be equivalent if I'm not wrong), instead uses the index:

SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes #> '{process_state}' AS anon_1, db_dbnode_1.attributes #> '{process_status}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.attributes #> '{sealed}' AS anon_4, db_dbnode_1.attributes #> '{process_label}' AS anon_5, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.node_type, db_dbnode_1.attributes #> '{paused}' AS anon_6, db_dbnode_1.process_type, db_dbnode_1.attributes #> '{state}' AS anon_7, db_dbnode_1.attributes #> '{scheduler_state}' AS anon_8, db_dbnode_1.attributes #> '{exception}' AS anon_9 
FROM db_dbnode AS db_dbnode_1 

WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.%%' AND jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'string' AND 
(
db_dbnode_1.attributes -> 'process_state'  @> '"created"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"waiting"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"running"')
ORDER BY db_dbnode_1.ctime;

It would be good if e.g. @ConradJohnston and @ltalirz could add the index with CREATE INDEX idx_process_state ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_ops); and then report:

the time of the first query (and check with EXPLAIN ANALYZE in front, that confirm it is not using the index (it should contain Seq Scan on db_dbnode db_dbnode_1).
the time of the second query (and check with EXPLAIN ANALYZE in front, that confirm it is using the index (it should contain things like Bitmap Index Scan on idx_process_state (multiple times)).
the size of the index with \di+ idx_process_state
the count of processes (SELECT COUNT(*) from db_dbnode WHERE CAST(node_type AS VARCHAR) LIKE 'process.%%';).

(As a note, if you add EXPLAIN ANALYZE, it will report the time at the end (please report both planning and execution time).

If you are feeling very wishing to contribute :-) you could also, later:

drop the index (DROP INDEX idx_process_state; - if it seems to hang, make sure you have no verdi shells or daemons running)
check that the second query does not use anymore any index;
create a new faster index CREATE INDEX idx_process_state_jsonpath ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_path_ops);
check the time of the queries now
check the size of the new index with \di+ idx_process_state_jsonpath

Maybe also @sphuber or @mbercx could do the same on their big DB?

A few more comments:

the CASE statements are generated by this line and following ones: https://github.com/aiidateam/aiida-core/blob/6c4ced3331b389cebd01a59d55eb4b07f9452672/aiida/orm/implementation/django/querybuilder.py#L283
the same appear also in the SQLAlchemy backend (maybe we could merge them?)
in principle it should be easy to change these; HOWEVER, one important thing to check carefully, if we want to switch away from CASE, is the following: we should check that the query does NOT crash if there exist at least a process_state that is not a string (IIRC, I think the CASE was added on purpose to guarantee that the query would only find strings and not other types, on which it would then crash; in SQL the query can be optimised and so I don't think the AND statements are processed in order like in python).

Final report - I created a new Data node in AiiDA (PK= 325325), then added a non-string process_state with

UPDATE db_dbnode SET attributes = jsonb_set(attributes, '{process_state}', '123') WHERE id=325325;

and then run the following query (even without the type query - note that I filtered only the most recent results to have few results, and removed the filter on node_type to get also the data node):

SELECT db_dbnode_1.id, db_dbnode_1.attributes #> '{process_state}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.node_type, db_dbnode_1.ctime
FROM db_dbnode AS db_dbnode_1 
WHERE 
jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'string' AND 
(
db_dbnode_1.attributes -> 'process_state'  @> '"created"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"waiting"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"finished"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"running"')
AND db_dbnode_1.ctime > '2021-05-01 04:00'
ORDER BY db_dbnode_1.ctime;

It all worked fine (no crashes). So things seem fine (and we might even drop the query on type in these cases, maybe) but I feel we might need additional checks as I remember there were cases in which casting was essential to avoid query crashes (pinging @lekah in case he remembers the exact case).

The case statement was there to avoid errors, say, for example, the attribute 'foo' is set to 'bar', and a user queries 'attributes.foo':{'>':3}. The type check ensures this won't crash. There should be a couple of tests to check for that behavior.

Thanks @lekah

For completeness, the last query I reported is written in SQLAlchemy as follows (this can be useful if we decide to write differently the query generated by the query builder, assuming we can avoid the crash mentioned above by @lekah):

from sqlalchemy import and_, or_, func
from sqlalchemy.orm import aliased
from sqlalchemy.dialects import postgresql as mydialect

from aiida.manage.manager import get_manager
import aiida.backends.djsite.db.models as djmodels

backend = get_manager().get_backend()
session = backend.get_session()

table_entity = djmodels.DbNode.sa
aliased_table = aliased(table_entity)
attr_entity = getattr(aliased_table, 'attributes')

q = session.query(aliased_table.id, attr_entity[('process_state',)], attr_entity[('exit_status',)], aliased_table.node_type, aliased_table.ctime)
q = q.filter(and_(
    *[
        #aliased_table.node_type.like('process.%'),
        func.jsonb_typeof(attr_entity['process_state']) == 'string',
        or_(
            *[
                attr_entity['process_state'].contains('"created"'),
                attr_entity['process_state'].contains('"waiting"'),
                attr_entity['process_state'].contains('"finished"'),
                attr_entity['process_state'].contains('"running"'),
            ]
        ),
        aliased_table.ctime > '2021-05-01 04:00'
    ]
))
q = q.order_by(aliased_table.ctime)
#q = q.limit(10)
print(str(q.statement.compile(compile_kwargs={'literal_binds': True}, dialect=mydialect.dialect())))

#print(q.all())

(note: this is for a django backend - very minimal changes required for a SQLA backend)

I confirm @lekah's comment. Here is a proof of concept if at least one value is not a valid integer:

SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1 
WHERE jsonb_typeof((db_dbnode_1.attributes -> 'process_state')) = 'number' AND CAST(db_dbnode_1.attributes -> 'process_state' AS FLOAT) > 100;

returns

ERROR:  cannot cast type jsonb to double precision
LINE 3: ..._1.attributes -> 'process_state')) = 'number' AND CAST(db_db...

since I have at least a string that cannot be cast (note that it does not matter that there is an AND filter). CAST also does not seem to provide a way to provide a default, so we're left with the ELSE statements we have already in AiiDA>

However, in our case we deal with string searches, so the idea could be that, when looking for strings with the QueryBuilder, we avoid casting altogether, and using the new syntax. We can keep the syntax instead for the rest.

Note also that it's possible to search using the index also for integer values, e.g. with this:

SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1 
WHERE jsonb_typeof((db_dbnode_1.attributes -> 'process_state')) = 'number' AND db_dbnode_1.attributes -> 'process_state' @> '123';

However using casting prevents the GIN index from working, as this does not use the index:

SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1 
WHERE jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'number' AND CAST(db_dbnode_1.attributes -> 'process_state' AS VARCHAR) = '123';

So this is a reason more to avoid casting in JSONB fields when looking for strings.

Finally: casting with the current approach will be instead needed for number values, but I think there is no efficient way to do numeric searches in JSONB fields (because of casting, WHEN clauses, and the unavailability JSONB indexes that work for non-strings)... The only thing that is efficient is searching for exact integer values (searched as strings) as shown above, but no > etc that uses an index.

This is something to keep in mind in general for AiiDA (of course this might be alleviated if one can pre-filter e.g. by the existence of a given field in a node first, and then do a sequential scan only on those nodes that have that field/key in the JSON - but we won't be able to do efficient numeric searches in huge DBs I think, and this will always need a custom DB).

aiidateam / aiida-core

Move process_state from JSONB attribute to a dedicated column #2770