aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 191 forks source link

Move process_state from JSONB attribute to a dedicated column #2770

Open ConradJohnston opened 5 years ago

ConradJohnston commented 5 years ago

Experience with verdi process list shows that queries using the new process_state variable are expensive for large databases. This is because there is no index on the JSONB attributes, and the query must do a sequential scan over these.

Proposed solutions:

  1. Move process_state to a dedicated column with an index.
  2. Add a GIN index to the JSONB (benchmarking needed).

Benchmarking from @sphuber below:

sphuber commented 3 years ago

Useful and detailed information on this subject by @giovannipizzi in another thread:

https://github.com/aiidateam/aiida-core/issues/4841#issuecomment-830866937 https://github.com/aiidateam/aiida-core/issues/4841#issuecomment-830867591

Pasted below


There's quite some discussion and debugging on GIN indexes on #4664 done by @CasperWA and me. The main lessons we learnt are:

From #4664, the syntax to create a GIN index only on a specific field is this:

CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade' -> 'elements') jsonb_path_ops);

and, what's even more interesting, you can create a partial index with a WHERE clause (e.g. to only add the index to processes, for instance, to reduce even more the size - note that the index will be used, however, only if the corresponding queries actually have the very same filter - so, for testing, maybe it's better to start with an index without WHERE clause).

The actual command to run should be:

CREATE INDEX idx_process_state ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_ops);

(one should check which operators are actually used, and if we can replace jsonb_ops with the faster but more limited json_path_ops, see bottom of this docs page.)

(Note: when timing, note that there a few last operations that are also slow: checking if there are too many slots used and issuing a warning, and checking when the status was last changed - in timing, you should remove that part; I don't remember if the --raw option does that, or you need to change the source code for timing).

However, we'll need also to revise the query generated by the QueryBuilder. In a simplified form a query could look like this:

select count(id) FROM db_dbnode where 
    CASE WHEN (jsonb_typeof(attributes #> '{process_state}') = 'string') THEN (attributes #>> '{process_state}') = 'finished' ELSE false END;

but (at least on my mid-size DB, with 41k calculations) this does not use the new index (checked with EXPLAIN ANALYZE in front of the query, it does a Seq Scan). I think it's because of the CASE.

@ltalirz you could try to:

  1. run the command to create the index (it should be fast, a matter of a few seconds)
  2. Try if the verdi process list becomes faster
  3. If not, check if you can write a query in SQL that becomes fast and report it here If we don't manage, the other option is what @sphuber mentioned (moving that specific field to a DB column, with an index - that will become very fast). It will require a bit of code refactoring, though - but might be worth anyway, if it turns out that to use the GIN index we need to refactor some other part of the code anyway (the QueryBuilder).

As a further comment, this query:

explain analyze select count(id) FROM db_dbnode where  attributes -> 'process_state' @> '"finished"';

uses the new index, while already this:

explain analyze select count(id) FROM db_dbnode where  attributes @> '{"process_state": "finished"}';

does not - PSQL seems quite picky on using the index only if you are really writing explicitly queries on attributes -> 'process_state', and does not use it for queries on attributes directly, even if in practice the query is the same. @ltalirz you could at least check if the "manual" query (this is a simplified one, not really the one done by verdi process list becomes acceptably fast or not, after adding the GIN index.

Also, this would use the index as well:

explain analyze select count(id) FROM db_dbnode where 
    jsonb_typeof(attributes #> '{process_state}') = 'string' AND attributes -> 'process_state' @> '"finished"';
giovannipizzi commented 3 years ago

As a further comment. The "default" query generated by the query builder in verdi process list (with default parameters) is:

SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes #> '{process_state}' AS anon_1, db_dbnode_1.attributes #> '{process_status}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.attributes #> '{sealed}' AS anon_4, db_dbnode_1.attributes #> '{process_label}' AS anon_5, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.node_type, db_dbnode_1.attributes #> '{paused}' AS anon_6, db_dbnode_1.process_type, db_dbnode_1.attributes #> '{state}' AS anon_7, db_dbnode_1.attributes #> '{scheduler_state}' AS anon_8, db_dbnode_1.attributes #> '{exception}' AS anon_9 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.%%' AND CASE WHEN (jsonb_typeof(db_dbnode_1.attributes #> '{process_state}') = 'string') THEN (db_dbnode_1.attributes #>> '{process_state}') IN ('created', 'waiting', 'running') ELSE false END ORDER BY db_dbnode_1.ctime

This does a sequential scan.

This query (that should be equivalent if I'm not wrong), instead uses the index:

SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes #> '{process_state}' AS anon_1, db_dbnode_1.attributes #> '{process_status}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.attributes #> '{sealed}' AS anon_4, db_dbnode_1.attributes #> '{process_label}' AS anon_5, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.node_type, db_dbnode_1.attributes #> '{paused}' AS anon_6, db_dbnode_1.process_type, db_dbnode_1.attributes #> '{state}' AS anon_7, db_dbnode_1.attributes #> '{scheduler_state}' AS anon_8, db_dbnode_1.attributes #> '{exception}' AS anon_9 
FROM db_dbnode AS db_dbnode_1 

WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.%%' AND jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'string' AND 
(
db_dbnode_1.attributes -> 'process_state'  @> '"created"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"waiting"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"running"')
ORDER BY db_dbnode_1.ctime;

It would be good if e.g. @ConradJohnston and @ltalirz could add the index with CREATE INDEX idx_process_state ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_ops); and then report:

(As a note, if you add EXPLAIN ANALYZE, it will report the time at the end (please report both planning and execution time).

If you are feeling very wishing to contribute :-) you could also, later:

Maybe also @sphuber or @mbercx could do the same on their big DB?

giovannipizzi commented 3 years ago

A few more comments:

giovannipizzi commented 3 years ago

Final report - I created a new Data node in AiiDA (PK= 325325), then added a non-string process_state with

UPDATE db_dbnode SET attributes = jsonb_set(attributes, '{process_state}', '123') WHERE id=325325;

and then run the following query (even without the type query - note that I filtered only the most recent results to have few results, and removed the filter on node_type to get also the data node):

SELECT db_dbnode_1.id, db_dbnode_1.attributes #> '{process_state}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.node_type, db_dbnode_1.ctime
FROM db_dbnode AS db_dbnode_1 
WHERE 
jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'string' AND 
(
db_dbnode_1.attributes -> 'process_state'  @> '"created"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"waiting"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"finished"' OR
db_dbnode_1.attributes -> 'process_state'  @> '"running"')
AND db_dbnode_1.ctime > '2021-05-01 04:00'
ORDER BY db_dbnode_1.ctime;

It all worked fine (no crashes). So things seem fine (and we might even drop the query on type in these cases, maybe) but I feel we might need additional checks as I remember there were cases in which casting was essential to avoid query crashes (pinging @lekah in case he remembers the exact case).

lekah commented 3 years ago

The case statement was there to avoid errors, say, for example, the attribute 'foo' is set to 'bar', and a user queries 'attributes.foo':{'>':3}. The type check ensures this won't crash. There should be a couple of tests to check for that behavior.

giovannipizzi commented 3 years ago

Thanks @lekah

For completeness, the last query I reported is written in SQLAlchemy as follows (this can be useful if we decide to write differently the query generated by the query builder, assuming we can avoid the crash mentioned above by @lekah):

from sqlalchemy import and_, or_, func
from sqlalchemy.orm import aliased
from sqlalchemy.dialects import postgresql as mydialect

from aiida.manage.manager import get_manager
import aiida.backends.djsite.db.models as djmodels

backend = get_manager().get_backend()
session = backend.get_session()

table_entity = djmodels.DbNode.sa
aliased_table = aliased(table_entity)
attr_entity = getattr(aliased_table, 'attributes')

q = session.query(aliased_table.id, attr_entity[('process_state',)], attr_entity[('exit_status',)], aliased_table.node_type, aliased_table.ctime)
q = q.filter(and_(
    *[
        #aliased_table.node_type.like('process.%'),
        func.jsonb_typeof(attr_entity['process_state']) == 'string',
        or_(
            *[
                attr_entity['process_state'].contains('"created"'),
                attr_entity['process_state'].contains('"waiting"'),
                attr_entity['process_state'].contains('"finished"'),
                attr_entity['process_state'].contains('"running"'),
            ]
        ),
        aliased_table.ctime > '2021-05-01 04:00'
    ]
))
q = q.order_by(aliased_table.ctime)
#q = q.limit(10)
print(str(q.statement.compile(compile_kwargs={'literal_binds': True}, dialect=mydialect.dialect())))

#print(q.all())

(note: this is for a django backend - very minimal changes required for a SQLA backend)

giovannipizzi commented 3 years ago

I confirm @lekah's comment. Here is a proof of concept if at least one value is not a valid integer:

SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1 
WHERE jsonb_typeof((db_dbnode_1.attributes -> 'process_state')) = 'number' AND CAST(db_dbnode_1.attributes -> 'process_state' AS FLOAT) > 100;

returns

ERROR:  cannot cast type jsonb to double precision
LINE 3: ..._1.attributes -> 'process_state')) = 'number' AND CAST(db_db...

since I have at least a string that cannot be cast (note that it does not matter that there is an AND filter). CAST also does not seem to provide a way to provide a default, so we're left with the ELSE statements we have already in AiiDA>

However, in our case we deal with string searches, so the idea could be that, when looking for strings with the QueryBuilder, we avoid casting altogether, and using the new syntax. We can keep the syntax instead for the rest.

Note also that it's possible to search using the index also for integer values, e.g. with this:

SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1 
WHERE jsonb_typeof((db_dbnode_1.attributes -> 'process_state')) = 'number' AND db_dbnode_1.attributes -> 'process_state' @> '123';

However using casting prevents the GIN index from working, as this does not use the index:

SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1 
WHERE jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'number' AND CAST(db_dbnode_1.attributes -> 'process_state' AS VARCHAR) = '123';

So this is a reason more to avoid casting in JSONB fields when looking for strings.

Finally: casting with the current approach will be instead needed for number values, but I think there is no efficient way to do numeric searches in JSONB fields (because of casting, WHEN clauses, and the unavailability JSONB indexes that work for non-strings)... The only thing that is efficient is searching for exact integer values (searched as strings) as shown above, but no > etc that uses an index.

This is something to keep in mind in general for AiiDA (of course this might be alleviated if one can pre-filter e.g. by the existence of a given field in a node first, and then do a sequential scan only on those nodes that have that field/key in the JSON - but we won't be able to do efficient numeric searches in huge DBs I think, and this will always need a custom DB).