Is the database really the right place for storing process checkpoints?

Checkpoints of processes, e.g. of workchains, are currently stored in the database, namely in the checkpoints attribute of the corresponding ProcessNode. I guess these checkpoints are never queried for, so effectively this is using the database as a key-value store for potentially large amounts of data (correct me if I'm wrong).

I am reporting here a case where I suspect that this implementation has led to extreme bloating of the postgresql database (factor of 100).

Use case

The database has 1.5M nodes; here are the cumulative size of the rows of different node types in the db_dbnodes table (size includes TOAST):

#Node type                               Size [GB]
data.structure.StructureData.            0.82
data.dict.Dict.                          0.43
process.calculation.calcjob.CalcJobNode. 0.30
process.workflow.workchain.WorkChainNode. 0.25
process.calculation.calcfunction.CalcFunctionNode. 0.08
data.remote.RemoteData.                  0.04
data.folder.FolderData.                  0.03
data.singlefile.SinglefileData.          0.01
data.int.Int.                            0.01
process.workflow.workfunction.WorkFunctionNode. 0.01
data.zeopp.parameters.NetworkParameters. 0.01
data.list.List.                          0.00
data.cif.CifData.                        0.00
data.bool.Bool.                          0.00
data.str.Str.                            0.00
data.float.Float.                        0.00
data.code.Code.                          0.00
...
Sum of all: 2.00 GB

That doesn't look too bad. If one looks for the largest rows, it turns out that these are a handful of WorkChainNodes (followed by StructureData).

SELECT pg_column_size(t.*), octet_length(t.*::text), id, node_type FROM db_dbnode AS t ORDER BY 1 DESC LIMIT 10;

 pg_column_size | octet_length |   id   |                 node_type
----------------+--------------+--------+-------------------------------------------
        6796034 |     33085238 | 613501 | process.workflow.workchain.WorkChainNode.
        6776518 |     33083900 | 613490 | process.workflow.workchain.WorkChainNode.
        6583871 |     32203789 | 645541 | process.workflow.workchain.WorkChainNode.
        6489467 |     33085192 | 613465 | process.workflow.workchain.WorkChainNode.
        6356272 |     31354828 | 645400 | process.workflow.workchain.WorkChainNode.
        6148939 |     29678621 | 644903 | process.workflow.workchain.WorkChainNode.
        5980014 |     31350388 | 645538 | process.workflow.workchain.WorkChainNode.
        5856665 |     28860705 | 644443 | process.workflow.workchain.WorkChainNode.
        5708072 |     28063333 | 644241 | process.workflow.workchain.WorkChainNode.
        5585683 |     27255363 | 643979 | process.workflow.workchain.WorkChainNode.
(10 rows)

The first column here should be roughly the space taken by the row in the postgres DB; the second column is the size after conversion to text. I.e. the largest row is a WorkChainNode that contains 33MB worth of text (and basically all of that is the checkpoint, see checkpoint.zip).

Now, the problem is this:

    table_schema    |       table_name        | row_estimate |   total    |   index    |   toast    |   table
--------------------+-------------------------+--------------+------------+------------+------------+------------
 public             | db_dbnode               |  1.59477e+06 | 341 GB     | 8691 MB    | 332 GB     | 902 MB
 public             | db_dblink               |  4.68451e+06 | 1558 MB    | 1197 MB    | 8192 bytes | 360 MB
 public             | db_dblog                |       696285 | 1131 MB    | 256 MB     | 16 MB      | 860 MB
 public             | db_dbgroup_dbnodes      |       139167 | 22 MB      | 16 MB      |            | 6256 kB

The actual size of db_dbnode is not 2GB, it is 341 GB!

Some more info

this was postgresql 10.17 on Ubuntu 18.04 with default settings (and aiida 1.6.4)
AUTOVACUUM was on
VACUUM ANALYZE did not change anything

VACUUM FULL ANALYZE reduced the total size of db_dbnode from 341GB to <3 GB.

My theory for what is happening here is that AiiDA keeps writing large amounts (MBs) of data to the checkpoints attributes of processes at every step. For some reason, postgresql is not able to "release" the previous data, so it just keeps accumulating on disk.

Unfortunately, the output of VACUUM FULL VERBOSE ANALYZE db_dbnode did not provide any indication of what data was being freed, so I cannot say with certainty that this theory is correct. However, this DB did contain workchains that ran particularly large numbers of steps (at least 6xN where N=30..80) and large checkpoints because they contained large StructureDataNodes (see example checkpoint above), so this seems plausible to me.

Overall, this leads me back to the question of whether storing checkpoints in the database (rather than the file repository / ...) is really the sensible thing to do here.

Mentioning @sphuber and @chrisjsewell for comment

P.S. If someone would like to check whether their database may be affected by this issue, I suggest you run the first of these queries on your database and check whether the toast size of db_dbnode is a lot larger than the table size (say, >10x).

Couple of things:

Only one checkpoint is ever written per node. At each step it is overwritten, so previous ones are not kept.
The checkpoint should be removed when the process terminates, so you should only ever have checkpoints in the database for active nodes.

These things combined should lead to acceptable data usage in the database, although I can see how in certain use cases where there are a lot of concurrently active processes with large checkpoints it can temporarily add a lot of data to the database. We could investigate what would happen to loading efficiency if we were to store the checkpoint in the repository instead.

Hi @sphuber , thanks for the quick reply.

Only one checkpoint is ever written per node. At each step it is overwritten, so previous ones are not kept.

Right, I guess what I'm speculating here is that while this may be true for the state of the database as viewed from outside postgres, it may, effectively, not be true for the data that postgres stores on disk since postgres for some reason is not able to automatically release the overwritten jsonb fields / rows / ...

The checkpoint should be removed when the process terminates, so you should only ever have checkpoints in the database for active nodes.

Ok, there were still active nodes in the DB, so those may be the ones that show up high in the list of largest db_dbnode rows.

Right, if PostgreSQL indeed does not automatically vacuum the space on disk when the checkpoints, than that would be problematic. Since this happens so often when running AiiDA, that really should be cleaned up automatically. I personally have no idea why this could be happening or even how we could confirm that the data being retained is due to the checkpoints that are stored in the attribute. Maybe @giovannipizzi has an idea how we could confirm or disprove this.

That being said, if we find that storing the checkpoint in the database is problematic, changing this to use the repository instead would be really easy. We just have to change the set_checkpoint method. And the get_checkpoint and delete_checkpoint of course. The only usual caveat is that this would require all processes to be finished first before the code is upgraded or a migration has to be added. However, even here I could foresee similar problems. Especially now that we have the new object store where object deletion is not direct but only the reference gets deleted and the actual space has to be freed with a manual operation. So with that solution we would actually end up in the situation that you are hypothesizing we are in now with storing the checkpoints in Postgres. So the solution may not be so trivial after all.

Yes, I was thinking the same.

Removing data from the database that isn't queried should be a good thing, but since the new file repository does not have a built-in autocleanup (as e.g. the file system would), it would not solve the "bloating" issue in production runs.

I did use this query to look at the DB before doing the VACUUM but I did not notice any indication of an issue (actually, I originally thought this meant that also a VACUUM FULL would not help and was surprised when it did)

 current_database                       | schemaname |     tablename      | tbloat | wastedbytes |                         iname                         | ibloat | wastedibytes
-------------------------------------------------------------+------------+--------------------+--------+-------------+-------------------------------------------------------+--------+--------------
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblog           |    1.2 |   130646016 | db_dblog_pkey                                         |    0.0 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblog           |    1.2 |   130646016 | db_dblog_loggername_00b5ba16                          |    0.1 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblog           |    1.2 |   130646016 | db_dblog_loggername_00b5ba16_like                     |    0.1 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblog           |    1.2 |   130646016 | db_dblog_levelname_ad5dc346                           |    0.0 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblog           |    1.2 |   130646016 | db_dblog_levelname_ad5dc346_like                      |    0.0 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblog           |    1.2 |   130646016 | db_dblog_uuid_9cf77df3_uniq                           |    0.0 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblog           |    1.2 |   130646016 | db_dblog_dbnode_id_da34b732                           |    0.0 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_user_id_12e7aeaf                            |    1.3 |    266534912
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_pkey                                        |    0.1 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_ctime_71626ef5                              |    0.8 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_type_a8ce9753                               |    1.5 |    367992832
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_type_a8ce9753_like                          |    1.5 |    383983616
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_uuid_62e0bf98_uniq                          |    0.1 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_process_type_df7298d0_like                  |    1.2 |    128049152
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_process_type_df7298d0                       |    1.2 |    126664704
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_mtime_0554ea3d                              |    1.1 |     48373760
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_label_6469539e                              |    0.9 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_label_6469539e_like                         |    0.9 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dbnode          |    1.1 |   107773952 | db_dbnode_dbcomputer_id_315372a3                      |    1.1 |     44826624
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblink          |    1.2 |    48791552 | db_dblink_pkey                                        |    0.5 |            0
 reference_set_mofs_daniele_9bdc01324561a830d82ce19453fbbf22 | public     | db_dblink          |    1.2 |    48791552 | db_dblink_label_f1343cfb                              |    1.0 |            0

P.S. For the VACUUM FULL I needed to kill all connections to the DB (stop the daemon workers, etc.), which was not needed for the VACUUM. I wonder whether it is possible that this backlog of data was somehow held back from being freed by one of these open connections...

Do you know whether this can be reproduced with both backends (django and sqla)? Just wondering whether this might be an issue of transaction isolation where PostgreSQL keeps the data around for other clients/connections to retrieve them in the previous state. And the ORM backends might actually differ in the way they handle this. If this is indeed the problem it might point to a deeper problem in our usage of the ORM and it might be a good idea to revise how we control transactions and which guarantees wrt consistency and isolation we actually need.

A few comments:

my understanding is that, upon any update, postures will actually (for efficiency) create a new row and point to it, so every update will not free up space from older versions.
a VACUUM will let postgres know that that space is not used, but (I think again for efficiency, and to allow to run VACUUM while the DB is being used) the space on disk is still used.
- From the documentation on VACUUM:
  
  In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.
only VACUUM FULL essentially rewrites the PSQL files on disk, freeing up space. That'a tricky operation (PSQL has to rewrite most if not all the DB on disk) and it cannot be run while others are using the DB (as mentioned above by Leo)
- From the documentation on VACUUM:
  
  Plain VACUUM (without FULL) simply reclaims space and makes it available for re-use. [...] However, extra space is not returned to the operating system (in most cases); it's just kept available for re-use within the same table. [...] VACUUM FULL rewrites the entire contents of the table into a new disk file with no extra space, allowing unused space to be returned to the operating system.

I think the concept of vacuuming is there in most DBs because of performance reasons. We need to inform users that this should be done often for big DBs (maybe we should try to make this one of the tasks of the new repository management? I.e., when we do a similar operation for the repository, we also do a vacuum full analyze for the DB? pinging @ramirezfranciscof and @chrisjsewell who are working on that). I don't think this is something we can avoid. (I think this should be the cause, but I'm not sure - it might also be that there was some transaction pending. If Leo manages to reproduce the issue and closes all connections, we could double check; but if I had to bet, I think it's because of the checkpoints).

As mentioned earlier, moving those to the object store would not help - you have the same issue of vacuuming also there (at least after you pack the loose objects), and in any case it does not seem to me the right place to store checkpoints, I see the repository only as a destination of permanent data.

And in general, even if this is not the root cause, if checkpoints can be so big, I would vote for moving them out from the DB.

By the way, this will also solve another important problem: if a user modifies e.g. extras of a node while it's still running, either the user or the daemon could in principle get errors because concurrent modifications are being done to the same node. I didn't check how changing extras and attributes is done in the 2 ORMs - and maybe it's done properly, so concurrent changes to both don't raise an exception when committing the transaction - but I see this as a potential problem (and I think we don't see it often because very rarely people change extras while jobs are running, but I think it's something that shouldn't happen anyway).

Since luckily the checkpoint management is quite confined in a few places (set_checkpoint etc.), and there is also a delete_checkpoint, a practical suggestion - that also would not be complex to implement - is the following.

in the "repository" folder, at the same level of the "container" folder, create a "checkpoints" folder (i.e. a sibling folder, so it's not interfering with the container)
Set checkpoint will write a file in this folder, one per process (with the same serialisation as now). I envision this file to be called checkpoints/aa/bbccddee-ff... (I.e. use a 1-level sharding). Experience showed that a 2-level sharding does not really help; a 1-level sharding (with 2 characters) seems to be the best compromise.
- Since the checkpoints should be deleted when the process is done, the number of files should never be too high (already 10'000 running processes is a lot, and with the sharding it's not going to stress the filesystem).
- Once can also implement a clean-up procedure to remove checkpoints of processes that are in a terminal state (e.g. together with the rest of the repository cleaning)
- the advantage of having this folder in the repository is that there is no 'mix-up' between profiles etc.
this should fix the issues above, probably be faster (both in storing, and in not hitting the DB performance), and I don't see an immediate problem? Also, this might also allow (in the future, if wanted) to e.g. activate a mode to easily keep old checkpoints - even all versions - for debugging?

By the way I checked one of my profile where I recently run a lot. In my case VACUUM FULL ANALYZE helped a lot in reducing the index size - it also reduced the toast size, but in my case it wasn't the main issue.

Before:


    table_schema    |       table_name        | row_estimate |   total    |   index    |   toast    |   table    
--------------------+-------------------------+--------------+------------+------------+------------+------------
 public             | db_dbnode               |       999019 | 6754 MB    | 5578 MB    | 635 MB     | 541 MB

After:

    table_schema    |       table_name        | row_estimate |   total    |   index    |   toast    |   table    
--------------------+-------------------------+--------------+------------+------------+------------+------------
 public             | db_dbnode               |       997595 | 1388 MB    | 615 MB     | 273 MB     | 500 MB

So, again, it would be beneficial to hint advanced users to vacuum the DB every now and then, and bundle this together with the repository vacuuming.

However, both VACUUM FULL and disk-object-store's repack both need additional disk space to store temporarily a full copy. E.g. from PSQL's docs:

This method also requires extra disk space, since it writes a new copy of the table and doesn't release the old copy until the operation is complete. Usually this should only be used when a significant amount of space needs to be reclaimed from within the table.

And the same applies to the disk-object-store. So for @chrisjsewell and @ramirezfranciscof it would be good to keep at least 2 levels of "vacuuming/repacking", one that can be run relatively often, faster, and one "full/deep" that requires some disk space, to stop everything, and might take time - but good to do every few months.

Just to briefly follow up as well (started writing before Gio's comments)

Hi @dev-zero , yes that is one possibility (fyi, this particular case was on a django backend). I think there are at least two, though:

The sql client (AiiDA) is somehow preventing postgresql from freeing the data.
The new data was written to the database too quickly for postgresql's built-in autovacuum schedule.

The VACUUM documentation isn't very explicit on what happens under the hood but I read here that the regular VACUUM never gives back space to the operating system but just removes dead tuples from the index (and does some internal defragmentation on the remaining data).

It turns out that the default default AUTOVACUUM settings include an autovacuum_vacuum_scale_factor of 0.2 (and a minimum threshold of 50, which is added on top), meaning that it will kick in only once 20% of the rows of a table (+50) have changed.

In the above example, this corresponds to 300k rows, i.e. perhaps it is possible that 300GB of checkpoints were written in between autovacuums (at ~6MB per checkpoint as shown above that would be 50k checkpoints).

If that is the problem, then a "quick fix" would be to substantially reduce the autovacuum_vacuum_scale_factor for the db_dbnodes table, but in the medium term I agree that the database is likely not the best place to store the checkpoints.

The nice thing about postgresql of course is that it does have a built-in autovacuum capability (even if the default settings may have failed us here), while this capability is still lacking in the new file repository implementation.

aiidateam / aiida-core

Is the database really the right place for storing process checkpoints? #5025

Use case