Open ConradJohnston opened 5 years ago
Useful and detailed information on this subject by @giovannipizzi in another thread:
https://github.com/aiidateam/aiida-core/issues/4841#issuecomment-830866937 https://github.com/aiidateam/aiida-core/issues/4841#issuecomment-830867591
Pasted below
There's quite some discussion and debugging on GIN indexes on #4664 done by @CasperWA and me. The main lessons we learnt are:
process_state
, this is something known to AiiDA - so a quick solution, even before trying to move this field to a proper column, would be to add an index just for this.From #4664, the syntax to create a GIN index only on a specific field is this:
CREATE INDEX idx_gin_optimade_elements ON db_dbnode USING GIN ((extras -> 'optimade' -> 'elements') jsonb_path_ops);
and, what's even more interesting, you can create a partial index with a WHERE clause (e.g. to only add the index to processes, for instance, to reduce even more the size - note that the index will be used, however, only if the corresponding queries actually have the very same filter - so, for testing, maybe it's better to start with an index without WHERE clause).
The actual command to run should be:
CREATE INDEX idx_process_state ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_ops);
(one should check which operators are actually used, and if we can replace jsonb_ops
with the faster but more limited json_path_ops
, see bottom of this docs page.)
(Note: when timing, note that there a few last operations that are also slow: checking if there are too many slots used and issuing a warning, and checking when the status was last changed - in timing, you should remove that part; I don't remember if the --raw option does that, or you need to change the source code for timing).
However, we'll need also to revise the query generated by the QueryBuilder. In a simplified form a query could look like this:
select count(id) FROM db_dbnode where
CASE WHEN (jsonb_typeof(attributes #> '{process_state}') = 'string') THEN (attributes #>> '{process_state}') = 'finished' ELSE false END;
but (at least on my mid-size DB, with 41k calculations) this does not use the new index (checked with EXPLAIN ANALYZE in front of the query, it does a Seq Scan
).
I think it's because of the CASE.
@ltalirz you could try to:
verdi process list becomes faster
As a further comment, this query:
explain analyze select count(id) FROM db_dbnode where attributes -> 'process_state' @> '"finished"';
uses the new index, while already this:
explain analyze select count(id) FROM db_dbnode where attributes @> '{"process_state": "finished"}';
does not - PSQL seems quite picky on using the index only if you are really writing explicitly queries on attributes -> 'process_state'
, and does not use it for queries on attributes
directly, even if in practice the query is the same.
@ltalirz you could at least check if the "manual" query (this is a simplified one, not really the one done by verdi process list
becomes acceptably fast or not, after adding the GIN index.
Also, this would use the index as well:
explain analyze select count(id) FROM db_dbnode where
jsonb_typeof(attributes #> '{process_state}') = 'string' AND attributes -> 'process_state' @> '"finished"';
As a further comment. The "default" query generated by the query builder in verdi process list
(with default parameters) is:
SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes #> '{process_state}' AS anon_1, db_dbnode_1.attributes #> '{process_status}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.attributes #> '{sealed}' AS anon_4, db_dbnode_1.attributes #> '{process_label}' AS anon_5, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.node_type, db_dbnode_1.attributes #> '{paused}' AS anon_6, db_dbnode_1.process_type, db_dbnode_1.attributes #> '{state}' AS anon_7, db_dbnode_1.attributes #> '{scheduler_state}' AS anon_8, db_dbnode_1.attributes #> '{exception}' AS anon_9
FROM db_dbnode AS db_dbnode_1
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.%%' AND CASE WHEN (jsonb_typeof(db_dbnode_1.attributes #> '{process_state}') = 'string') THEN (db_dbnode_1.attributes #>> '{process_state}') IN ('created', 'waiting', 'running') ELSE false END ORDER BY db_dbnode_1.ctime
This does a sequential scan.
This query (that should be equivalent if I'm not wrong), instead uses the index:
SELECT db_dbnode_1.id, db_dbnode_1.uuid, db_dbnode_1.ctime, db_dbnode_1.mtime, db_dbnode_1.attributes #> '{process_state}' AS anon_1, db_dbnode_1.attributes #> '{process_status}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.attributes #> '{sealed}' AS anon_4, db_dbnode_1.attributes #> '{process_label}' AS anon_5, db_dbnode_1.label, db_dbnode_1.description, db_dbnode_1.node_type, db_dbnode_1.attributes #> '{paused}' AS anon_6, db_dbnode_1.process_type, db_dbnode_1.attributes #> '{state}' AS anon_7, db_dbnode_1.attributes #> '{scheduler_state}' AS anon_8, db_dbnode_1.attributes #> '{exception}' AS anon_9
FROM db_dbnode AS db_dbnode_1
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'process.%%' AND jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'string' AND
(
db_dbnode_1.attributes -> 'process_state' @> '"created"' OR
db_dbnode_1.attributes -> 'process_state' @> '"waiting"' OR
db_dbnode_1.attributes -> 'process_state' @> '"running"')
ORDER BY db_dbnode_1.ctime;
It would be good if e.g. @ConradJohnston and @ltalirz could add the index with CREATE INDEX idx_process_state ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_ops);
and then report:
EXPLAIN ANALYZE
in front, that confirm it is not using the index (it should contain Seq Scan on db_dbnode db_dbnode_1
).EXPLAIN ANALYZE
in front, that confirm it is using the index (it should contain things like Bitmap Index Scan on idx_process_state
(multiple times)).\di+ idx_process_state
SELECT COUNT(*) from db_dbnode WHERE CAST(node_type AS VARCHAR) LIKE 'process.%%';
).(As a note, if you add EXPLAIN ANALYZE
, it will report the time at the end (please report both planning and execution time).
If you are feeling very wishing to contribute :-) you could also, later:
DROP INDEX idx_process_state;
- if it seems to hang, make sure you have no verdi shells or daemons running)CREATE INDEX idx_process_state_jsonpath ON db_dbnode USING GIN ((attributes -> 'process_state') jsonb_path_ops);
\di+ idx_process_state_jsonpath
Maybe also @sphuber or @mbercx could do the same on their big DB?
A few more comments:
CASE
statements are generated by this line and following ones:
https://github.com/aiidateam/aiida-core/blob/6c4ced3331b389cebd01a59d55eb4b07f9452672/aiida/orm/implementation/django/querybuilder.py#L283process_state
that is not a string (IIRC, I think the CASE was added on purpose to guarantee that the query would only find strings and not other types, on which it would then crash; in SQL the query can be optimised and so I don't think the AND statements are processed in order like in python). Final report - I created a new Data node in AiiDA (PK= 325325), then added a non-string process_state
with
UPDATE db_dbnode SET attributes = jsonb_set(attributes, '{process_state}', '123') WHERE id=325325;
and then run the following query (even without the type query - note that I filtered only the most recent results to have few results, and removed the filter on node_type to get also the data node):
SELECT db_dbnode_1.id, db_dbnode_1.attributes #> '{process_state}' AS anon_2, db_dbnode_1.attributes #> '{exit_status}' AS anon_3, db_dbnode_1.node_type, db_dbnode_1.ctime
FROM db_dbnode AS db_dbnode_1
WHERE
jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'string' AND
(
db_dbnode_1.attributes -> 'process_state' @> '"created"' OR
db_dbnode_1.attributes -> 'process_state' @> '"waiting"' OR
db_dbnode_1.attributes -> 'process_state' @> '"finished"' OR
db_dbnode_1.attributes -> 'process_state' @> '"running"')
AND db_dbnode_1.ctime > '2021-05-01 04:00'
ORDER BY db_dbnode_1.ctime;
It all worked fine (no crashes). So things seem fine (and we might even drop the query on type in these cases, maybe) but I feel we might need additional checks as I remember there were cases in which casting was essential to avoid query crashes (pinging @lekah in case he remembers the exact case).
The case statement was there to avoid errors, say, for example, the attribute 'foo' is set to 'bar', and a user queries 'attributes.foo':{'>':3}. The type check ensures this won't crash. There should be a couple of tests to check for that behavior.
Thanks @lekah
For completeness, the last query I reported is written in SQLAlchemy as follows (this can be useful if we decide to write differently the query generated by the query builder, assuming we can avoid the crash mentioned above by @lekah):
from sqlalchemy import and_, or_, func
from sqlalchemy.orm import aliased
from sqlalchemy.dialects import postgresql as mydialect
from aiida.manage.manager import get_manager
import aiida.backends.djsite.db.models as djmodels
backend = get_manager().get_backend()
session = backend.get_session()
table_entity = djmodels.DbNode.sa
aliased_table = aliased(table_entity)
attr_entity = getattr(aliased_table, 'attributes')
q = session.query(aliased_table.id, attr_entity[('process_state',)], attr_entity[('exit_status',)], aliased_table.node_type, aliased_table.ctime)
q = q.filter(and_(
*[
#aliased_table.node_type.like('process.%'),
func.jsonb_typeof(attr_entity['process_state']) == 'string',
or_(
*[
attr_entity['process_state'].contains('"created"'),
attr_entity['process_state'].contains('"waiting"'),
attr_entity['process_state'].contains('"finished"'),
attr_entity['process_state'].contains('"running"'),
]
),
aliased_table.ctime > '2021-05-01 04:00'
]
))
q = q.order_by(aliased_table.ctime)
#q = q.limit(10)
print(str(q.statement.compile(compile_kwargs={'literal_binds': True}, dialect=mydialect.dialect())))
#print(q.all())
(note: this is for a django backend - very minimal changes required for a SQLA backend)
I confirm @lekah's comment. Here is a proof of concept if at least one value is not a valid integer:
SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1
WHERE jsonb_typeof((db_dbnode_1.attributes -> 'process_state')) = 'number' AND CAST(db_dbnode_1.attributes -> 'process_state' AS FLOAT) > 100;
returns
ERROR: cannot cast type jsonb to double precision
LINE 3: ..._1.attributes -> 'process_state')) = 'number' AND CAST(db_db...
since I have at least a string that cannot be cast (note that it does not matter that there is an AND filter). CAST also does not seem to provide a way to provide a default, so we're left with the ELSE statements we have already in AiiDA>
However, in our case we deal with string searches, so the idea could be that, when looking for strings with the QueryBuilder, we avoid casting altogether, and using the new syntax. We can keep the syntax instead for the rest.
Note also that it's possible to search using the index also for integer values, e.g. with this:
SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1
WHERE jsonb_typeof((db_dbnode_1.attributes -> 'process_state')) = 'number' AND db_dbnode_1.attributes -> 'process_state' @> '123';
However using casting prevents the GIN index from working, as this does not use the index:
SELECT db_dbnode_1.attributes #> '{process_state}' AS anon_1
FROM db_dbnode AS db_dbnode_1
WHERE jsonb_typeof(db_dbnode_1.attributes -> 'process_state') = 'number' AND CAST(db_dbnode_1.attributes -> 'process_state' AS VARCHAR) = '123';
So this is a reason more to avoid casting in JSONB fields when looking for strings.
Finally: casting with the current approach will be instead needed for number values, but I think there is no efficient way to do numeric searches in JSONB fields (because of casting, WHEN clauses, and the unavailability JSONB indexes that work for non-strings)... The only thing that is efficient is searching for exact integer values (searched as strings) as shown above, but no >
etc that uses an index.
This is something to keep in mind in general for AiiDA (of course this might be alleviated if one can pre-filter e.g. by the existence of a given field in a node first, and then do a sequential scan only on those nodes that have that field/key in the JSON - but we won't be able to do efficient numeric searches in huge DBs I think, and this will always need a custom DB).
Experience with
verdi process list
shows that queries using the newprocess_state
variable are expensive for large databases. This is because there is no index on the JSONB attributes, and the query must do a sequential scan over these.Proposed solutions:
process_state
to a dedicated column with an index.Benchmarking from @sphuber below: