Postgres connection handling over restart of postgres server

ltalirz commented 4 years ago

The symptom of the problem is as follows: In Materials Cloud discover sections written using frameworks like bokeh / panel server + an AiiDA profile, a restart of the postgresql server results in QueryBuilder queries failing with

8, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 588, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.AdminShutdown) terminating connection due to administrator command
SSL connection has been closed unexpectedly

and also

File "base.py", line 223, in create_cursor:
cursor = self.connection.cursor() Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/django/db/backends/base/base.py", line 235, in _cursor
    return self._prepare_cursor(self.create_cursor(name))
  File "/usr/local/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 223, in create_cursor
    cursor = self.connection.cursor()
psycopg2.InterfaceError: connection already closed

This is not ideal since these sections may run on separate VMs etc., so it would be important to be able to restart the postgres server without running into such problems.

I'm not entirely sure how to proceed here - is the problem that AiiDA seems to be keeping a connection open at all times, or is this somehow intended behavior and we need to provide a way to recover from receiving the error with some try/except mechanism?

I can look into fixing this, just wanted to get some input from folks who worked on this more, e.g. from @giovannipizzi @CasperWA

sphuber commented 4 years ago

Have you tried running the following

from aiida.manage.manager import get_manager
get_manager().get_backend().get_session().close()

Note that this functionality was only added recently so you might need at least aiida-core==1.1.1

ltalirz commented 4 years ago

Thanks for the suggestion - I'm not quite sure I understand what you mean, so let me try to explain the use case better: you can think of the discover section essentially as a python script that runs indefinitely and, every now and then, uses the QueryBuilder (triggered users going on the web page).

Are you suggesting I should add get_manager().get_backend().get_session().close() after every use of the QueryBuilder (this could be in many places of the app)? Will this close the session in a "clean way" so that, the next time the QueryBuilder is called, it will automatically establish a new one?

sphuber commented 4 years ago

So you have a persistent python interpreter process running? How is it listening to requests? Is this going through the REST API?

The use case you are describing seems related to that of a web server and special caution needs to be taken in that case. As @CasperWA has also faced this problem with the optimade server, AiiDA is not intended to be run in a threaded way, but HTTP servers typically do. I think our own REST API, since it is using Flask, is also doing this. This was giving problems with the SqlAlchemy backend since the connections were not being cleaned up properly after a thread was joined. When running AiiDA in a threaded way, the user is responsible for closing SqlA sessions once the thread ends. I added the API for this that I posted in my original comment. The typical code for a REST API server then is to perform the query, get all the results and then close the session, before ending the thread by returning the results to the main thread. This ensures connections being returned to the pool.

In your case, I also think this will help against the postgres server being reset. By resetting the session after every query, the connection won't become stale.

The alternative, of building this into AiiDA and catching this condition, I am not sure exactly where it should be put. Nor exactly what kind of exceptions it should try to catch. But we might also need to think in this direction. I think the same problem may occur when just running AiiDA in production mode with the daemon running and Postgres is restarted. Probably the daemon runners will face the same exception. To be investigated...

ltalirz commented 4 years ago

So you have a persistent python interpreter process running? How is it listening to requests? Is this going through the REST API? The use case you are describing seems related to that of a web server

Yes. It is listening to requests via the bokeh server. It interacts with AiiDA directly via the python API (very similar to the optimade server).

I think the same problem may occur when just running AiiDA in production mode with the daemon running and Postgres is restarted. Probably the daemon runners will face the same exception. To be investigated...

Exactly, I was also thinking about that.

While threads might also play a role here, I think we need to consider what should happen if you have a single thread that is long-running and continues to interact with the database.

Should the connection to the DB remain always open?
And, if yes, should we provide a mechanism for recovering from an interrupted connection (I would say yes)

giovannipizzi commented 4 years ago

One of the two errors (AdminShutDown) maybe happens if there is a query running while rebooting? Maybe in that case we can't do much. Instead the second is just because the connection is recycled (which by default is ok I think, for efficiency - we shouldn't try to reopen the connection every time).

See here for a possible way to create an engine with a listener that reopens it: https://stackoverflow.com/questions/3033234/handle-mysql-restart-in-sqlalchemy

I think in general the idea is good and I would vote to go in the direction of using a listener on the engine. Be careful as it's a 10-year-old post, with various edits, so before implementing it's better to check in the most recent SQLAlchemy docs that this is still the intended way to go. Then, the hard task will be testing it robustly...

@ltalirz feel free to investigate in this direction, if you guys also feel this is a good way to go (BTW this is also important probably if the DB is on a different machine and the connection drops?)

CasperWA commented 3 years ago

I'll leave this here as a reference: PgBouncer. It seems like it could be a possible alternative to set this up and let it handle the idle connections.

sphuber commented 2 years ago

Just writing down a discussion I had last evening with @ltalirz about a similar problem that occurred.

We got a report from a user that submitted some processes to the daemon that then seemed to be stuck in the Created state. We figured out that in between the daemon having been started and the processes being submitted, the Postgres service had been restarted. After restarting the daemon, new processes could be submitted and started running without problems, however, the original stuck processes remained stuck.

When looking in the daemon logs, it was confirmed that at some point during the Process that the daemon worker was running, a connection to the database was opened to retrieve or store some data on the node, which failed because the server was unresponsive (probably during the restart). The exception most likely caused the daemon worker to acknowledge the task being finished and so the task was lost. Normally an exception during the lifetime of a process is reflected on the node so it is visible to the user through e.g. verdi, however, that obviously requires a database connection which was not available as that was the cause of the exception to begin with.

This example is similar to the example of the OP: there are a variety of long-lived AiiDA processes, REST API instance, daemon worker, interactive shell, that all will make calls to the database. Currently, there is no system in place to deal with lost connections. We probably haven't encountered these problems too much, since most users will have Postgres running on the same machine and so dropped connections will be unlikely. However, once AiiDA starts getting deployed in cloud configurations with the database on a different machine, these problems will become more likely and so we need a way to try and retry the connection and recover from connection losses.

This will already make a great step in robustness, but it will not help against persistent outages of the database service. In the case of a prolonged outage, we will still need a mechanism for the daemon workers to stop trying to connect to the database and notify the user. Note that we cannot use the pausing infrastructure written for calculation jobs, as this requires the database to function. The only thing I can see for now is that if a daemon worker fails to connect multiple times to the database, it refuses all new tasks and shuts itself down. We should make sure of course that the daemonizer will not automatically revive the worker in this case. We should then also find a way to communicate to the user that the database service is experiencing connection issues and should be looked into. It is not clear how best to do this. So far, the communication channels from worker to user is a) by proxying on nodes in the database b) through the log file. Option a) won't be available in this scenario and b) is very user-unfriendly as we cannot expect users to monitor the daemon logs.

ltalirz commented 2 years ago

Currently, there is no system in place to deal with lost connections.

To be a bit more precise here: as of today, the "optimistic disconnect handling" of sqlalchemy is actually in place and seems to be working:

the first time a DBAPIError is encountered when trying to execute a query, all connections in the pool are invalidated
the second time you try to connect, sqlalchemy tries to recreate the pool (and will succeed if the database is available by then).

You can test this as follows:

open a jupyter notebook / verdi shell
run an aiida calculation (just to make sure the db connection works)
restart the database
run an aiida calculation (fails)
run an aiida calculation again (works)

In the case of the workers as well as the REST API, each process will have their own connection pool, so one would expect to see this issue once per process.

What AiiDA is missing, however, is a mechanism to "retry" in such cases. In the case of the REST API, the request will fail. In the case of Processes, my understanding is that the rabbitmq task will be lost, which is somewhat more severe as processes are typically more complex / long-running than REST API requests.

edan-bainglass commented 1 week ago

@sphuber @ltalirz Has there been any resolution here? I'm interesting in this from the perspective of thread-safety in AiiDA, or at least a reasonable bypass (perhaps as suggested/implemented by @sphuber) for use in restAPI calls and/or DB queries as part of AiiDA-centric webapps.

sphuber commented 1 week ago

There hasn't been any work on this as far as I am aware. The problem described in the OP is not really related to threading. I misunderstood the problem in my initial response. The real problem here is that AiiDA should be able to automatically recover when connections to the database fail if the database is restarted for example. Currently an exception is raised that is not handled. For Processes that are running, this will result in them failing and not being picked up again, even when the database connection is restored.

unkcpz commented 1 week ago

@edan-bainglass for the thread-safe issue, you can have a look at https://github.com/aiidateam/aiida-core/issues/5765. The problem we saw in AiiDAlab is from the lifetime of node object is shorter than sqlalchemy session.

ltalirz commented 5 days ago

I think one underlying problem here is that the application I was talking about does not actually need a permanent connection/session, but currently does keep one in memory.

From first principles, the best approach would probably be to close the connection after a user interaction here.

Short of that, one could implement the try/except that @sphuber mentions, which can be done directly at the AiiDA level and would probably be useful there anyhow (independent of this "webapp" usage). Chatgpt says that in that case one would need to session.rollback() in the except, but I haven't looked into it.

aiidateam / aiida-core

Postgres connection handling over restart of postgres server #3867