Open ltalirz opened 3 years ago
Couple of things:
These things combined should lead to acceptable data usage in the database, although I can see how in certain use cases where there are a lot of concurrently active processes with large checkpoints it can temporarily add a lot of data to the database. We could investigate what would happen to loading efficiency if we were to store the checkpoint in the repository instead.
Hi @sphuber , thanks for the quick reply.
- Only one checkpoint is ever written per node. At each step it is overwritten, so previous ones are not kept.
Right, I guess what I'm speculating here is that while this may be true for the state of the database as viewed from outside postgres, it may, effectively, not be true for the data that postgres stores on disk since postgres for some reason is not able to automatically release the overwritten jsonb fields / rows / ...
The checkpoint should be removed when the process terminates, so you should only ever have checkpoints in the database for active nodes.
Ok, there were still active nodes in the DB, so those may be the ones that show up high in the list of largest db_dbnode
rows.
Right, if PostgreSQL indeed does not automatically vacuum the space on disk when the checkpoints, than that would be problematic. Since this happens so often when running AiiDA, that really should be cleaned up automatically. I personally have no idea why this could be happening or even how we could confirm that the data being retained is due to the checkpoints that are stored in the attribute. Maybe @giovannipizzi has an idea how we could confirm or disprove this.
That being said, if we find that storing the checkpoint in the database is problematic, changing this to use the repository instead would be really easy. We just have to change the set_checkpoint
method. And the get_checkpoint
and delete_checkpoint
of course. The only usual caveat is that this would require all processes to be finished first before the code is upgraded or a migration has to be added. However, even here I could foresee similar problems. Especially now that we have the new object store where object deletion is not direct but only the reference gets deleted and the actual space has to be freed with a manual operation. So with that solution we would actually end up in the situation that you are hypothesizing we are in now with storing the checkpoints in Postgres. So the solution may not be so trivial after all.
Yes, I was thinking the same.
Removing data from the database that isn't queried should be a good thing, but since the new file repository does not have a built-in autocleanup (as e.g. the file system would), it would not solve the "bloating" issue in production runs.
I did use this query to look at the DB before doing the VACUUM but I did not notice any indication of an issue (actually, I originally thought this meant that also a VACUUM FULL
would not help and was surprised when it did)
current_database | schemaname | tablename | tbloat | wastedbytes | iname | ibloat | wastedibytes
-------------------------------------------------------------+------------+--------------------+--------+-------------+-------------------------------------------------------+--------+--------------
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblog | 1.2 | 130646016 | db_dblog_pkey | 0.0 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblog | 1.2 | 130646016 | db_dblog_loggername_00b5ba16 | 0.1 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblog | 1.2 | 130646016 | db_dblog_loggername_00b5ba16_like | 0.1 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblog | 1.2 | 130646016 | db_dblog_levelname_ad5dc346 | 0.0 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblog | 1.2 | 130646016 | db_dblog_levelname_ad5dc346_like | 0.0 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblog | 1.2 | 130646016 | db_dblog_uuid_9cf77df3_uniq | 0.0 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblog | 1.2 | 130646016 | db_dblog_dbnode_id_da34b732 | 0.0 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_user_id_12e7aeaf | 1.3 | 266534912
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_pkey | 0.1 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_ctime_71626ef5 | 0.8 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_type_a8ce9753 | 1.5 | 367992832
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_type_a8ce9753_like | 1.5 | 383983616
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_uuid_62e0bf98_uniq | 0.1 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_process_type_df7298d0_like | 1.2 | 128049152
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_process_type_df7298d0 | 1.2 | 126664704
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_mtime_0554ea3d | 1.1 | 48373760
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_label_6469539e | 0.9 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_label_6469539e_like | 0.9 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dbnode | 1.1 | 107773952 | db_dbnode_dbcomputer_id_315372a3 | 1.1 | 44826624
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblink | 1.2 | 48791552 | db_dblink_pkey | 0.5 | 0
reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public | db_dblink | 1.2 | 48791552 | db_dblink_label_f1343cfb | 1.0 | 0
P.S. For the VACUUM FULL
I needed to kill all connections to the DB (stop the daemon workers, etc.), which was not needed for the VACUUM
.
I wonder whether it is possible that this backlog of data was somehow held back from being freed by one of these open connections...
Do you know whether this can be reproduced with both backends (django and sqla)? Just wondering whether this might be an issue of transaction isolation where PostgreSQL keeps the data around for other clients/connections to retrieve them in the previous state. And the ORM backends might actually differ in the way they handle this. If this is indeed the problem it might point to a deeper problem in our usage of the ORM and it might be a good idea to revise how we control transactions and which guarantees wrt consistency and isolation we actually need.
A few comments:
update
, postures will actually (for efficiency) create a new row and point to it, so every update will not free up space from older versions.In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.
Plain VACUUM (without FULL) simply reclaims space and makes it available for re-use. [...] However, extra space is not returned to the operating system (in most cases); it's just kept available for re-use within the same table. [...] VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system.
I think the concept of vacuuming is there in most DBs because of performance reasons. We need to inform users that this should be done often for big DBs (maybe we should try to make this one of the tasks of the new repository management? I.e., when we do a similar operation for the repository, we also do a vacuum full analyze
for the DB? pinging @ramirezfranciscof and @chrisjsewell who are working on that).
I don't think this is something we can avoid. (I think this should be the cause, but I'm not sure - it might also be that there was some transaction pending. If Leo manages to reproduce the issue and closes all connections, we could double check; but if I had to bet, I think it's because of the checkpoints).
As mentioned earlier, moving those to the object store would not help - you have the same issue of vacuuming also there (at least after you pack the loose objects), and in any case it does not seem to me the right place to store checkpoints, I see the repository only as a destination of permanent data.
And in general, even if this is not the root cause, if checkpoints can be so big, I would vote for moving them out from the DB.
By the way, this will also solve another important problem: if a user modifies e.g. extras of a node while it's still running, either the user or the daemon could in principle get errors because concurrent modifications are being done to the same node. I didn't check how changing extras and attributes is done in the 2 ORMs - and maybe it's done properly, so concurrent changes to both don't raise an exception when committing the transaction - but I see this as a potential problem (and I think we don't see it often because very rarely people change extras while jobs are running, but I think it's something that shouldn't happen anyway).
Since luckily the checkpoint management is quite confined in a few places (set_checkpoint
etc.), and there is also a delete_checkpoint
, a practical suggestion - that also would not be complex to implement - is the following.
checkpoints/aa/bbccddee-ff...
(I.e. use a 1-level sharding). Experience showed that a 2-level sharding does not really help; a 1-level sharding (with 2 characters) seems to be the best compromise.
By the way I checked one of my profile where I recently run a lot. In my case VACUUM FULL ANALYZE helped a lot in reducing the index size - it also reduced the toast size, but in my case it wasn't the main issue.
Before:
table_schema | table_name | row_estimate | total | index | toast | table
--------------------+-------------------------+--------------+------------+------------+------------+------------
public | db_dbnode | 999019 | 6754 MB | 5578 MB | 635 MB | 541 MB
After:
table_schema | table_name | row_estimate | total | index | toast | table
--------------------+-------------------------+--------------+------------+------------+------------+------------
public | db_dbnode | 997595 | 1388 MB | 615 MB | 273 MB | 500 MB
So, again, it would be beneficial to hint advanced users to vacuum the DB every now and then, and bundle this together with the repository vacuuming.
However, both VACUUM FULL
and disk-object-store's repack
both need additional disk space to store temporarily a full copy. E.g. from PSQL's docs:
This method also requires extra disk space, since it writes a new copy of the table and doesn't release the old copy until the operation is complete. Usually this should only be used when a significant amount of space needs to be reclaimed from within the table.
And the same applies to the disk-object-store. So for @chrisjsewell and @ramirezfranciscof it would be good to keep at least 2 levels of "vacuuming/repacking", one that can be run relatively often, faster, and one "full/deep" that requires some disk space, to stop everything, and might take time - but good to do every few months.
Just to briefly follow up as well (started writing before Gio's comments)
Hi @dev-zero , yes that is one possibility (fyi, this particular case was on a django backend). I think there are at least two, though:
The VACUUM documentation isn't very explicit on what happens under the hood but I read here that the regular VACUUM
never gives back space to the operating system but just removes dead tuples from the index (and does some internal defragmentation on the remaining data).
It turns out that the default default AUTOVACUUM settings include an autovacuum_vacuum_scale_factor
of 0.2 (and a minimum threshold of 50, which is added on top), meaning that it will kick in only once 20% of the rows of a table (+50) have changed.
In the above example, this corresponds to 300k rows, i.e. perhaps it is possible that 300GB of checkpoints were written in between autovacuums (at ~6MB per checkpoint as shown above that would be 50k checkpoints).
If that is the problem, then a "quick fix" would be to substantially reduce the autovacuum_vacuum_scale_factor
for the db_dbnodes
table, but in the medium term I agree that the database is likely not the best place to store the checkpoints.
The nice thing about postgresql of course is that it does have a built-in autovacuum capability (even if the default settings may have failed us here), while this capability is still lacking in the new file repository implementation.
Checkpoints of processes, e.g. of workchains, are currently stored in the database, namely in the
checkpoints
attribute of the correspondingProcessNode
. I guess these checkpoints are never queried for, so effectively this is using the database as a key-value store for potentially large amounts of data (correct me if I'm wrong).I am reporting here a case where I suspect that this implementation has led to extreme bloating of the postgresql database (factor of 100).
Use case
The database has 1.5M nodes; here are the cumulative size of the rows of different node types in the
db_dbnodes
table (size includesTOAST
):That doesn't look too bad. If one looks for the largest rows, it turns out that these are a handful of
WorkChainNode
s (followed byStructureData
).The first column here should be roughly the space taken by the row in the postgres DB; the second column is the size after conversion to text. I.e. the largest row is a
WorkChainNode
that contains 33MB worth of text (and basically all of that is the checkpoint, see checkpoint.zip).Now, the problem is this:
The actual size of
db_dbnode
is not 2GB, it is 341 GB!Some more info
AUTOVACUUM
was onVACUUM ANALYZE
did not change anythingVACUUM FULL ANALYZE
reduced the total size ofdb_dbnode
from 341GB to <3 GB.My theory for what is happening here is that AiiDA keeps writing large amounts (MBs) of data to the
checkpoints
attributes of processes at every step. For some reason, postgresql is not able to "release" the previous data, so it just keeps accumulating on disk.Unfortunately, the output of
VACUUM FULL VERBOSE ANALYZE db_dbnode
did not provide any indication of what data was being freed, so I cannot say with certainty that this theory is correct. However, this DB did contain workchains that ran particularly large numbers of steps (at least 6xN where N=30..80) and large checkpoints because they contained large StructureDataNodes (see example checkpoint above), so this seems plausible to me.Overall, this leads me back to the question of whether storing checkpoints in the database (rather than the file repository / ...) is really the sensible thing to do here.
Mentioning @sphuber and @chrisjsewell for comment
P.S. If someone would like to check whether their database may be affected by this issue, I suggest you run the first of these queries on your database and check whether the
toast
size ofdb_dbnode
is a lot larger than thetable
size (say, >10x).