Django backend (possibly) not closing PostgreSQL connections

CasperWA commented 4 years ago

In using the OPTIMADE Client I found an issue with the Materials Cloud AiiDA-OPTIMADE servers that is noted to be an issue of not closing 100+ PostgreSQL database connections (see related issue).

I was looking into the PR I created to handle the issue of closing SQLAlchemy sessions for the AiiDA REST API (#3974), and came across the following line's asymmetry between the backends, which might be fine, but might not be.

Django:

https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/backends/djsite/__init__.py#L65

SQLAlchemy:

https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/backends/sqlalchemy/__init__.py#L57

Could this be the cause of the issue? The database that caused the issue was indeed using a django backend.

Pinging @sphuber and @giovannipizzi

sphuber commented 4 years ago

If you are referring to the expire_on_commit then most likely the answer is no. That flag merely determines whether the objects that are held in the cache by SqlAlchemy should be expired when the session is committed, meaning they will have to be refreshed from the database when operated on. This should not have anything to do with the connection being closed or not.

ltalirz commented 4 years ago

The source of the problem is indeed that idle worker processes (REST APIs, optimade APIs, discover sections... [1]) keep their database connection open, see writeup here.

I believe we now need to look into making sure that these web services close the postgresql connection when it is no longer needed. One could think of offering a context manager replacement of load_profile() which takes care of closing the connection on exit (unloading the python modules is not needed). This would, however, require developers of AiiDA web applications to do explicit connection handling. We should think about whether there is a way to get around this [2] and, if not, we need to pay great attention to make this as simple to use as possible.

P.S. Issue https://github.com/aiidateam/aiida-core/issues/3867 is related but concerns the use case of persistent connections and essentially asks for a way to recover if those connections are interrupted due to external reasons.

[1] The same probably holds in the AiiDA lab, where every running jupyter notebook may have its own DB connection. Since these aren't threaded, however, one is unlikely to hit the 100 connection limit.

[2] If the goal is just to fix the issue encountered on materials cloud, one could of course also just add a cronjob on the db server that kills connections that have been idling for too long (+ a fix of https://github.com/aiidateam/aiida-core/issues/3867). That's more of a workaround than a solution, however.

mentioning also @giovannipizzi for info

ltalirz commented 4 years ago

P.S. Just another thought: As long as we manage to keep the AiiDA version of the databases hosted on Materials Cloud homogeneous, a model where the connection/profile is coupled to a particular request would actually allow us to serve all REST API requests from the same worker process (scaling to as many processes as needed to serve all requests). Information on which profile to connect to could be passed via a header or parsed from the request URL.

With the current model, we need to give each profile's REST API server enough resources to serve a reasonable number of requests, even if this profile is requested only very rarely (multiplying memory requirements [1] by the number of profiles that we are serving). The alternative model would lead to much more efficient use of RAM.

I believe the same holds for optimade requests (but not for discover sections, where the "server code" differs from one to the other).

[1] Memory usage by an apache wsgi daemon for the AiiDA REST API is roughly ~100MB. Similar for a REST API served via a docker container running a gunicorn process.

CasperWA commented 4 years ago

Again here, thanks for the investigation of this @ltalirz !

A context manager similar to what is implemented in the REST API would be good, but may be difficult to implement. See https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/restapi/resources.py#L98 and the implementation at https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/restapi/common/utils.py#L853

However, it may also be good with a simple "close" functionality if possible, since the load_profile call is usually called prior to any endpoint functionality, although it may not be necessary?

Concerning your latest comment about keeping a coupled connection/profile seems like a great solution, but at the same time quite limiting in terms of testing AiiDA versions, but if we do this in production only, it might work out great. The issue would then still persist on the development servers though - unless you would force AiiDA version limitations also there?

ltalirz commented 4 years ago

Thanks for the pointers to the REST API code @CasperWA

So, it seems you are taking care of opening and closing sessions for the QueryBuilder, but the underlying connection is remaining untouched. I.e. we would need to add handling of the connection in a similar way.

~~One thing that puzzles me about this code is the comment "Close SQLA session after any method call" while, actually, method decorators are supposedly applied before the function call - perhaps you can comment. To me it looks like the session will remain open after the call, and then closed and opened again with the next call.~~ Ok, I've seen the implementation of the decorator now. Very clever!

The issue would then still persist on the development servers though - unless you would force AiiDA version limitations also there?

I agree that it is important to retain the ability to run REST APIs for different AiiDA versions, even if just for development purposes. This is definitely still possible with the approach I mentioned (you then just need one server per AiiDA version).

CasperWA commented 4 years ago

Thanks for the pointers to the REST API code @CasperWA

So, it seems you are taking care of opening and closing sessions for the QueryBuilder, but the underlying connection is remaining untouched. I.e. we would need to add handling of the connection in a similar way.

This was indeed my point, the hinted code in the REST API was merely to show how your suggested context manager solution might work, since we already have a solution that sort of does this. So it was for inspiration in the end :) But also to say that I am not sure dealing with the general PostgreSQL connection in the same way necessarily makes sense? I am unsure whether it would work or not. For now, the AiiDA profile is loaded upon starting the REST API server (the same is the case for the OPTIMADE REST API server), and then it's loaded throughout the lifetime of the server. Perhaps a solution would be a daemon running that closes connections after X idle time and makes sure to open a new one upon requests? Or the loading of the profile is put into the request level functionality, similar to the QueryBuilder SQLAlchemy sessions - this is the issue though, I'm not sure if this can be done?

Ok, I've seen the implementation of the decorator now. Very clever!

Thanks :tada: it was combined efforts between @sphuber and I :)

This is definitely still possible with the approach I mentioned (you then just need one server per AiiDA version).

Hmm, is that feasible?

ltalirz commented 4 years ago

This forum post may be relevant https://forum.djangoproject.com/t/closing-old-db-connections-outside-of-the-request-context/299

for requests via django, django already has handlers set up to clean up old connections https://github.com/django/django/blob/21ff23bfeb4014bceaa3df27677fb68409c0634d/django/db/__init__.py#L53-L61 Of course, we aren't using django but flask for the REST API, so these aren't triggered
We could certainly add a similar decorator to the flask requests (I'll have a quick stab at this). It might require some extra code for being able to reestablish a connection when the original one has been closed (which would then also fix https://github.com/aiidateam/aiida-core/issues/3867)
They mention that sqlalchemy is more flexible in this respect since connections are exposed to the user and not global singleton objects managed by the framework s in the case of django. By using only the database component of django without the web server, we're probably a rather niche user of django, so that is something to keep in mind for the sqla vs django discussion going forward.

CasperWA commented 4 years ago

This reminds me of a question as well @ltalirz: Were the list of databases in your overview all using Django as backend or are some using SQLAlchemy? Or in other words, is this an issue with using Django or a more general issue?

sphuber commented 4 years ago

I think this is a problem of configuration and not necessarily a problem in aiida-core of handling the connections. I already responded here https://github.com/aiidateam/aiida-optimade/issues/117, but I will copy the response here for ease of disussion:

Since you guys are mostly just using AiiDA to query the database and are not running the daemon, the only connections should go through the query builder. Even for Django, this goes through a SqlAlchemy and it is SqlAlchemy that manages a connection pool. The idea is then that we don't have to close the connections (which also has an overhead and closing/opening everytime may not be the most efficient) and we simply reuse open ones when needed. I think this may ultimately simply be a configuration problem. Of course if you start serving too many applications from a single server and PSQL cluster, at some point you run out of resources. If you think the current amount of projects should perfectly be manageable with the default of a 100 connections, then we can always configure the connection pool of SqlAlchemy. @CasperWA added a commit not too long ago (the commit to test the correct handling of the session for the REST API) that allows to configure the parameters of the SQLA connection pool. I don't think it is fully plumbed through that you can configure this dynamically per AiiDA profile, but if necessary you could add this. Otherwise you can try a temporary hardcoded solution and limit the number of connections in a pool. The default number of connections per pool seems to be 5.

ltalirz commented 4 years ago

thanks for the comment @sphuber , I'll look into exactly which connections are set up

Since you guys are mostly just using AiiDA to query the database and are not running the daemon, the only connections should go through the query builder.

Just to be sure: are you saying that if one is not running the daemon, one is not using the DB backend? Perhaps my whole understanding of how the connection works is wrong, but what about Int(5).store()?

sphuber commented 4 years ago

Just to be sure: are you saying that if one is not running the daemon, one is not using the DB backend? Perhaps my whole understanding of how the connection works is wrong, but what about Int(5).store()?

No, my point was that you then have only a single process running AiiDA and so a single ORM, which means there is a single SqlAlchemy engine with a single connection pool. The default is 5 connections, so I think each AiiDA process is already limited to 5 connections max. Although, now that I think of it, there is the overflow parameter, max_overflow which is 10 by default I believe. So maybe, worst case, each AiiDA process can have up to 15 connections, but I think SQLA should automatically close the connections that are in the overflow. Those just serve to buffer sudden temporary spikes and should not remain open for longer periods of time.

ltalirz commented 4 years ago

I see, but what if it is using a django database?

To follow up - when running a rest api on a django profile, after the first query, two connections are opened (and remain open), as seen from the postgresql server. I'll need to check whether these are both coming from sqla or whether one of them comes from django

sphuber commented 4 years ago

I see, but what if it is using a django database?

That is the point, even for a Django backend, the database connection through the QueryBuilder still goes through SqlAlchemy and so the same concepts apply. It is possible that in our setup the Django database connection is still opened, that I am not sure, but we should verify this.

ltalirz commented 4 years ago

It is possible that in our setup the Django database connection is still opened, that I am not sure, but we should verify this.

And that is my point ;-) I'm checking now

sphuber commented 4 years ago

See this issue also: https://github.com/aiidateam/aiida-core/issues/2039

CasperWA commented 4 years ago

I see, but what if it is using a django database?

That is the point, even for a Django backend, the database connection through the QueryBuilder still goes through SqlAlchemy and so the same concepts apply. It is possible that in our setup the Django database connection is still opened, that I am not sure, but we should verify this.

But this will not be the case if a Node is updated, like for the OPTIMADE REST API server upon the first request (updating the extras).

ltalirz commented 4 years ago

After forcing pool_size=1 here: https://github.com/aiidateam/aiida-core/blob/fe8333e92736644ee1248c626e9b30b8d0f6fa14/aiida/backends/utils.py#L40-L42

a rest api of an sqla profile only opens one persistent connection (down from two, as noted by martin in https://github.com/aiidateam/aiida-core/issues/2039)
a rest api of a django profile still opens two persistent connections, i.e. one of those is likely opened by django

I.e. in case of a django profile, we will need to handle both the sqla connection and the django connection (which for the REST API can probably simply be closed once and for all if what Seb says is correct).

The idea is then that we don't have to close the connections (which also has an overhead and closing/opening everytime may not be the most efficient) and we simply reuse open ones when needed.

While this certainly makes sense for regular AiiDA usage, as the number of processes querying the DB increases, postgresql connections eventually become a scarce resource and one is better off closing them when they are not needed (see article). It's a balance to strike, of course, and we should check whether opening/closing a DB connection per request has significant performance impact on the REST API. However, we need to keep in mind that even if we just allow a single persistent connection per thread, this can easily explode into very many.

Edit: Things differ between REST API and verdi shell. On a django backend I see this:

verdi shell => 1 postresql connection (same as opening ipython, loading profile and then loading backend) 1 QueryBuilder query => 1 additional postgresql connection

CasperWA commented 4 years ago

The idea is then that we don't have to close the connections (which also has an overhead and closing/opening everytime may not be the most efficient) and we simply reuse open ones when needed.

While this certainly makes sense for regular AiiDA usage, as the number of processes querying the DB increases, postgresql connections eventually become a scarce resource and one is better off closing them when they are not needed (see article). It's a balance to strike, of course, and we should check whether opening/closing a DB connection per request has significant performance impact on the REST API.

The ideal here, I think, would then be to figure out if there's a timer on connections. E.g., if one is performing several requests through a REST API over a small amount of time, it would probably take longer if for each request the connection would have to be closed and reopened. Appropriate live-time for a connection (which MUST be reused by the REST API server in question) could be 30 min.?

However, we need to keep in mind that even if we just allow a single persistent connection per thread, this can easily explode into very many.

Indeed. However, what we have now is then worse than this, with 2 connections per thread no matter the backend, or did I misunderstand?

ltalirz commented 4 years ago

Just as a follow-up to @giovannipizzi's question concerning where time is being spent when loading the backend, I did some timing tests.

My MacBook Pro 15" 2015 (postgres server on same machine)

ipython
In [1]: from aiida import load_profile

In [2]: load_profile()
Out[2]: <aiida.manage.configuration.profile.Profile at 0x7facb9d35978>

In [3]: from aiida.manage.manager import get_manager

In [4]: m=get_manager()

In [5]: %prun -D backend2.prof m.get_backend()

This takes ~1.5s on my machine, with 94% of the time spent inside importlib._bootstrap (function exec_module)

Then I did:

python -m timeit -n 1 -r 1 -s "import psycopg2" 'conn= psycopg2.connect("dbname=... user=... password=...")'
1 loops, best of 1: 3.05 msec per loop

python -m timeit -n 1 -r 10 -s "import psycopg2" 'conn= psycopg2.connect("dbname=... user=... password=...")'
1 loops, best of 10: 2.44 msec per loop

Note: I checked also manually that this really establishes the connection, and calling %prun on an individual connect statement also returns ~3ms.

Materials Cloud REST API server (postgres server on different machine)

In [1]: from aiida import load_profile

In [2]: load_profile()
Out[2]: <aiida.manage.configuration.profile.Profile at 0x7ff50f214ba8>

In [3]: from aiida.manage.manager import get_manager

In [4]: m=get_manager()

In [5]: %prun -D backend.prof m.get_backend()

This takes ~5.4s.

python -m timeit -n 1 -r 1 -s "import psycopg2" 'conn= psycopg2.connect("host=... dbname=... user=... password=...")'
1 loops, best of 1: 10.6 msec per loop

python -m timeit -n 1 -r 10 -s "import psycopg2" 'conn= psycopg2.connect("host=... dbname=... user=... password=...")'
1 loops, best of 10: 6.28 msec per loop

Conclusion

While opening a new postgresql connection is not instantaneous (~3-10ms), I believe it would be an acceptable delay on a per-request basis. And it is dwarfed by the time required to import the backend python modules (factor ~500 on an SSD with postgresql host on the same machine; similar factor on the materials cloud server with a slow file system and postgresql host running on a different machine).

Note: There will of course also be python code involved when considering profile switching on a per-request basis; one will need to figure out how expensive this is on the python side.

ltalirz commented 3 years ago

Just to add some further testing notes (current develop):

It seems to me that the number of connections opened on a standard django profile depends a lot on the initial load on the server. E.g. when I start a new verdi restapi and then open a URL that requires access to the database (e.g. http://127.0.0.1:5000/api/v4/groups and hit CMD+R like crazy, I can get up to 7 or more connections on the postgresql server side (instead, if I do it slowly, the number of connections remains at 2). If I then stop doing anything, the connections remain open. If I then continue requests to the page, the number of open connections either stays constant or can even go down.

Given that

these are connections that are opened on a per-request basis (i.e. I believe they are sqlalchemy connections for the querybuilder)
you have implemented the close_connection decorator this behavior is somewhat unexpected. It seems to me that there are scenarios under which new connections are not closed as they should.

Will look into this further.

CasperWA commented 3 years ago

Wouldn't it make sense, though, that each new request opens a new connection (SQLA session), which it then closes? Or is this irrelevant/separate to the number of connections you're seeing? This always confuses me slightly. Especially when mixing in django in all of this...

ltalirz commented 3 years ago

Wouldn't it make sense, though, that each new request opens a new connection (SQLA session), which it then closes?

Yes!

I should probably have been more clear on that the number of postgresql connections I was observing were in between queries. E.g. when I say there are 7 open connections, this is "without any active request".

ltalirz commented 3 years ago

Before I look into trying to get rid of superfluous django connections, here a summary of the observations for sqlalchemy profiles as a reference to compare against

sqlalchemy profiles

On sqlalchemy profiles, the number of connections on the postgresql side correctly reflects the settings of the QueuePool. By switching to a NullPool, the number of permanent connections drops to zero. I noticed that the first request to, e.g. api/v4/nodes requires at least 2 concurrent connections (hangs with pool_size=1 and max_overflow=0) and results in connections being checked out from the queue pool and put back 5 times [1]. Subsequent requests only require one such checkout.

Here some timings of requests to the http://127.0.0.1:5000/api/v4/nodes endpoint on a tiny database using the built-in server on my macbook (local postgres DB):

QueuePool with 5 connections
- first request: ~400ms
- subsequent requests (average): ~67 ms
NullPool
- first request: ~400ms
- subsequent requests (average): ~71 ms

This is consistent with the overhead of opening a postgres connection mentioned above and would indicate that switching to a NullPool would incur acceptable performance hit (~5% for a cheap query; less for more expensive queries), while getting rid of both the issue with the large number of open connections and with stale connections after database restarts https://github.com/aiidateam/aiida-core/issues/3867 Tests were done on my macbook, but I don't expect a significant difference on the mcloud servers.

[1] To see this, set verdi config logging.sqlalchemy_loglevel DEBUG, verdi restapi 2> sqla_prof.log and grep -i queuepool sqla_prof.log

ltalirz commented 3 years ago

django profiles

For django profiles, one has an additional persistent connection that is initialized together with the dbenv.

One simple workaround to close the connection in the close_session decorator via

from django.db import connection
connection.close()

I've added this and the NullPool in https://github.com/aiidateam/aiida-core/pull/4741 , let's see whether the tests pass.

Another interesting observation is that, from the sqlalchemy logs, also the first request to a django profile only results in one connection being checked out from the sqla QueuePool and put back once. I.e. that is all that's needed for the QueryBuilder work.

This suggests that the 4 extra connection checkouts observed on sqlalchemy profiles (on first request) are caused by some initialization of the sqla database (e.g. I did notice alembic queries in the logs) that could be entirely skipped when running the REST API.

sphuber commented 2 years ago

Since Django is dropped for v2.0, I am closing this. Feel free to reopen if it persists in v2.0 nonetheless or if this is a critical problem that needs to be fixed on v1.6.x.

aiidateam / aiida-core