Closed CasperWA closed 2 years ago
If you are referring to the expire_on_commit
then most likely the answer is no. That flag merely determines whether the objects that are held in the cache by SqlAlchemy should be expired when the session is committed, meaning they will have to be refreshed from the database when operated on. This should not have anything to do with the connection being closed or not.
The source of the problem is indeed that idle worker processes (REST APIs, optimade APIs, discover sections... [1]) keep their database connection open, see writeup here.
I believe we now need to look into making sure that these web services close the postgresql connection when it is no longer needed.
One could think of offering a context manager replacement of load_profile()
which takes care of closing the connection on exit (unloading the python modules is not needed). This would, however, require developers of AiiDA web applications to do explicit connection handling.
We should think about whether there is a way to get around this [2] and, if not, we need to pay great attention to make this as simple to use as possible.
P.S. Issue https://github.com/aiidateam/aiida-core/issues/3867 is related but concerns the use case of persistent connections and essentially asks for a way to recover if those connections are interrupted due to external reasons.
[1] The same probably holds in the AiiDA lab, where every running jupyter notebook may have its own DB connection. Since these aren't threaded, however, one is unlikely to hit the 100 connection limit.
[2] If the goal is just to fix the issue encountered on materials cloud, one could of course also just add a cronjob on the db server that kills connections that have been idling for too long (+ a fix of https://github.com/aiidateam/aiida-core/issues/3867). That's more of a workaround than a solution, however.
mentioning also @giovannipizzi for info
P.S. Just another thought: As long as we manage to keep the AiiDA version of the databases hosted on Materials Cloud homogeneous, a model where the connection/profile is coupled to a particular request would actually allow us to serve all REST API requests from the same worker process (scaling to as many processes as needed to serve all requests). Information on which profile to connect to could be passed via a header or parsed from the request URL.
With the current model, we need to give each profile's REST API server enough resources to serve a reasonable number of requests, even if this profile is requested only very rarely (multiplying memory requirements [1] by the number of profiles that we are serving). The alternative model would lead to much more efficient use of RAM.
I believe the same holds for optimade requests (but not for discover sections, where the "server code" differs from one to the other).
[1] Memory usage by an apache wsgi daemon for the AiiDA REST API is roughly ~100MB. Similar for a REST API served via a docker container running a gunicorn process.
Again here, thanks for the investigation of this @ltalirz !
A context manager similar to what is implemented in the REST API would be good, but may be difficult to implement. See https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/restapi/resources.py#L98 and the implementation at https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/restapi/common/utils.py#L853
However, it may also be good with a simple "close" functionality if possible, since the load_profile
call is usually called prior to any endpoint functionality, although it may not be necessary?
Concerning your latest comment about keeping a coupled connection/profile seems like a great solution, but at the same time quite limiting in terms of testing AiiDA versions, but if we do this in production only, it might work out great. The issue would then still persist on the development servers though - unless you would force AiiDA version limitations also there?
Thanks for the pointers to the REST API code @CasperWA
So, it seems you are taking care of opening and closing sessions for the QueryBuilder, but the underlying connection is remaining untouched. I.e. we would need to add handling of the connection in a similar way.
~~One thing that puzzles me about this code is the comment "Close SQLA session after any method call" while, actually, method decorators are supposedly applied before the function call - perhaps you can comment. To me it looks like the session will remain open after the call, and then closed and opened again with the next call.~~ Ok, I've seen the implementation of the decorator now. Very clever!
The issue would then still persist on the development servers though - unless you would force AiiDA version limitations also there?
I agree that it is important to retain the ability to run REST APIs for different AiiDA versions, even if just for development purposes. This is definitely still possible with the approach I mentioned (you then just need one server per AiiDA version).
Thanks for the pointers to the REST API code @CasperWA
So, it seems you are taking care of opening and closing sessions for the QueryBuilder, but the underlying connection is remaining untouched. I.e. we would need to add handling of the connection in a similar way.
This was indeed my point, the hinted code in the REST API was merely to show how your suggested context manager solution might work, since we already have a solution that sort of does this. So it was for inspiration in the end :) But also to say that I am not sure dealing with the general PostgreSQL connection in the same way necessarily makes sense? I am unsure whether it would work or not. For now, the AiiDA profile is loaded upon starting the REST API server (the same is the case for the OPTIMADE REST API server), and then it's loaded throughout the lifetime of the server. Perhaps a solution would be a daemon running that closes connections after X idle time and makes sure to open a new one upon requests? Or the loading of the profile is put into the request level functionality, similar to the QueryBuilder SQLAlchemy sessions - this is the issue though, I'm not sure if this can be done?
Ok, I've seen the implementation of the decorator now. Very clever!
Thanks :tada: it was combined efforts between @sphuber and I :)
This is definitely still possible with the approach I mentioned (you then just need one server per AiiDA version).
Hmm, is that feasible?
This forum post may be relevant https://forum.djangoproject.com/t/closing-old-db-connections-outside-of-the-request-context/299
This reminds me of a question as well @ltalirz: Were the list of databases in your overview all using Django as backend or are some using SQLAlchemy? Or in other words, is this an issue with using Django or a more general issue?
I think this is a problem of configuration and not necessarily a problem in aiida-core
of handling the connections. I already responded here https://github.com/aiidateam/aiida-optimade/issues/117, but I will copy the response here for ease of disussion:
Since you guys are mostly just using AiiDA to query the database and are not running the daemon, the only connections should go through the query builder. Even for Django, this goes through a SqlAlchemy and it is SqlAlchemy that manages a connection pool. The idea is then that we don't have to close the connections (which also has an overhead and closing/opening everytime may not be the most efficient) and we simply reuse open ones when needed. I think this may ultimately simply be a configuration problem. Of course if you start serving too many applications from a single server and PSQL cluster, at some point you run out of resources. If you think the current amount of projects should perfectly be manageable with the default of a 100 connections, then we can always configure the connection pool of SqlAlchemy. @CasperWA added a commit not too long ago (the commit to test the correct handling of the session for the REST API) that allows to configure the parameters of the SQLA connection pool. I don't think it is fully plumbed through that you can configure this dynamically per AiiDA profile, but if necessary you could add this. Otherwise you can try a temporary hardcoded solution and limit the number of connections in a pool. The default number of connections per pool seems to be 5.
thanks for the comment @sphuber , I'll look into exactly which connections are set up
Since you guys are mostly just using AiiDA to query the database and are not running the daemon, the only connections should go through the query builder.
Just to be sure: are you saying that if one is not running the daemon, one is not using the DB backend?
Perhaps my whole understanding of how the connection works is wrong, but what about Int(5).store()
?
Just to be sure: are you saying that if one is not running the daemon, one is not using the DB backend? Perhaps my whole understanding of how the connection works is wrong, but what about
Int(5).store()
?
No, my point was that you then have only a single process running AiiDA and so a single ORM, which means there is a single SqlAlchemy engine with a single connection pool. The default is 5 connections, so I think each AiiDA process is already limited to 5 connections max. Although, now that I think of it, there is the overflow parameter, max_overflow
which is 10 by default I believe. So maybe, worst case, each AiiDA process can have up to 15 connections, but I think SQLA should automatically close the connections that are in the overflow. Those just serve to buffer sudden temporary spikes and should not remain open for longer periods of time.
I see, but what if it is using a django database?
To follow up - when running a rest api on a django
profile, after the first query, two connections are opened (and remain open), as seen from the postgresql server.
I'll need to check whether these are both coming from sqla or whether one of them comes from django
I see, but what if it is using a django database?
That is the point, even for a Django backend, the database connection through the QueryBuilder
still goes through SqlAlchemy and so the same concepts apply. It is possible that in our setup the Django database connection is still opened, that I am not sure, but we should verify this.
It is possible that in our setup the Django database connection is still opened, that I am not sure, but we should verify this.
And that is my point ;-) I'm checking now
See this issue also: https://github.com/aiidateam/aiida-core/issues/2039
I see, but what if it is using a django database?
That is the point, even for a Django backend, the database connection through the
QueryBuilder
still goes through SqlAlchemy and so the same concepts apply. It is possible that in our setup the Django database connection is still opened, that I am not sure, but we should verify this.
But this will not be the case if a Node is updated, like for the OPTIMADE REST API server upon the first request (updating the extras).
After forcing pool_size=1
here:
https://github.com/aiidateam/aiida-core/blob/fe8333e92736644ee1248c626e9b30b8d0f6fa14/aiida/backends/utils.py#L40-L42
I.e. in case of a django profile, we will need to handle both the sqla connection and the django connection (which for the REST API can probably simply be closed once and for all if what Seb says is correct).
The idea is then that we don't have to close the connections (which also has an overhead and closing/opening everytime may not be the most efficient) and we simply reuse open ones when needed.
While this certainly makes sense for regular AiiDA usage, as the number of processes querying the DB increases, postgresql connections eventually become a scarce resource and one is better off closing them when they are not needed (see article). It's a balance to strike, of course, and we should check whether opening/closing a DB connection per request has significant performance impact on the REST API. However, we need to keep in mind that even if we just allow a single persistent connection per thread, this can easily explode into very many.
Edit: Things differ between REST API and verdi shell. On a django backend I see this:
verdi shell
=> 1 postresql connection (same as opening ipython
, loading profile and then loading backend)
1 QueryBuilder query => 1 additional postgresql connection
The idea is then that we don't have to close the connections (which also has an overhead and closing/opening everytime may not be the most efficient) and we simply reuse open ones when needed.
While this certainly makes sense for regular AiiDA usage, as the number of processes querying the DB increases, postgresql connections eventually become a scarce resource and one is better off closing them when they are not needed (see article). It's a balance to strike, of course, and we should check whether opening/closing a DB connection per request has significant performance impact on the REST API.
The ideal here, I think, would then be to figure out if there's a timer on connections. E.g., if one is performing several requests through a REST API over a small amount of time, it would probably take longer if for each request the connection would have to be closed and reopened. Appropriate live-time for a connection (which MUST be reused by the REST API server in question) could be 30 min.?
However, we need to keep in mind that even if we just allow a single persistent connection per thread, this can easily explode into very many.
Indeed. However, what we have now is then worse than this, with 2 connections per thread no matter the backend, or did I misunderstand?
Just as a follow-up to @giovannipizzi's question concerning where time is being spent when loading the backend, I did some timing tests.
ipython
In [1]: from aiida import load_profile
In [2]: load_profile()
Out[2]: <aiida.manage.configuration.profile.Profile at 0x7facb9d35978>
In [3]: from aiida.manage.manager import get_manager
In [4]: m=get_manager()
In [5]: %prun -D backend2.prof m.get_backend()
This takes ~1.5s on my machine, with 94% of the time spent inside importlib._bootstrap
(function exec_module
)
Then I did:
python -m timeit -n 1 -r 1 -s "import psycopg2" 'conn= psycopg2.connect("dbname=... user=... password=...")'
1 loops, best of 1: 3.05 msec per loop
python -m timeit -n 1 -r 10 -s "import psycopg2" 'conn= psycopg2.connect("dbname=... user=... password=...")'
1 loops, best of 10: 2.44 msec per loop
Note: I checked also manually that this really establishes the connection, and calling %prun
on an individual connect
statement also returns ~3ms.
In [1]: from aiida import load_profile
In [2]: load_profile()
Out[2]: <aiida.manage.configuration.profile.Profile at 0x7ff50f214ba8>
In [3]: from aiida.manage.manager import get_manager
In [4]: m=get_manager()
In [5]: %prun -D backend.prof m.get_backend()
This takes ~5.4s.
python -m timeit -n 1 -r 1 -s "import psycopg2" 'conn= psycopg2.connect("host=... dbname=... user=... password=...")'
1 loops, best of 1: 10.6 msec per loop
python -m timeit -n 1 -r 10 -s "import psycopg2" 'conn= psycopg2.connect("host=... dbname=... user=... password=...")'
1 loops, best of 10: 6.28 msec per loop
While opening a new postgresql connection is not instantaneous (~3-10ms), I believe it would be an acceptable delay on a per-request basis. And it is dwarfed by the time required to import the backend python modules (factor ~500 on an SSD with postgresql host on the same machine; similar factor on the materials cloud server with a slow file system and postgresql host running on a different machine).
Note: There will of course also be python code involved when considering profile switching on a per-request basis; one will need to figure out how expensive this is on the python side.
Just to add some further testing notes (current develop
):
It seems to me that the number of connections opened on a standard django profile depends a lot on the initial load on the server.
E.g. when I start a new verdi restapi
and then open a URL that requires access to the database (e.g. http://127.0.0.1:5000/api/v4/groups and hit CMD+R like crazy, I can get up to 7 or more connections on the postgresql server side (instead, if I do it slowly, the number of connections remains at 2).
If I then stop doing anything, the connections remain open.
If I then continue requests to the page, the number of open connections either stays constant or can even go down.
Given that
close_connection
decorator
this behavior is somewhat unexpected.
It seems to me that there are scenarios under which new connections are not closed as they should.Will look into this further.
Wouldn't it make sense, though, that each new request opens a new connection (SQLA session), which it then closes? Or is this irrelevant/separate to the number of connections you're seeing? This always confuses me slightly. Especially when mixing in django in all of this...
Wouldn't it make sense, though, that each new request opens a new connection (SQLA session), which it then closes?
Yes!
I should probably have been more clear on that the number of postgresql connections I was observing were in between queries. E.g. when I say there are 7 open connections, this is "without any active request".
Before I look into trying to get rid of superfluous django connections, here a summary of the observations for sqlalchemy
profiles as a reference to compare against
On sqlalchemy
profiles, the number of connections on the postgresql side correctly reflects the settings of the QueuePool
. By switching to a NullPool
, the number of permanent connections drops to zero.
I noticed that the first request to, e.g. api/v4/nodes
requires at least 2 concurrent connections (hangs with pool_size=1
and max_overflow=0
) and results in connections being checked out from the queue pool and put back 5 times [1]. Subsequent requests only require one such checkout.
Here some timings of requests to the http://127.0.0.1:5000/api/v4/nodes endpoint on a tiny database using the built-in server on my macbook (local postgres DB):
QueuePool
with 5 connections
NullPool
This is consistent with the overhead of opening a postgres connection mentioned above and would indicate that switching to a NullPool
would incur acceptable performance hit (~5% for a cheap query; less for more expensive queries), while getting rid of both the issue with the large number of open connections and with stale connections after database restarts https://github.com/aiidateam/aiida-core/issues/3867
Tests were done on my macbook, but I don't expect a significant difference on the mcloud servers.
[1] To see this, set verdi config logging.sqlalchemy_loglevel DEBUG
, verdi restapi 2> sqla_prof.log
and grep -i queuepool sqla_prof.log
For django profiles, one has an additional persistent connection that is initialized together with the dbenv.
One simple workaround to close the connection in the close_session
decorator via
from django.db import connection
connection.close()
I've added this and the NullPool
in https://github.com/aiidateam/aiida-core/pull/4741 , let's see whether the tests pass.
Another interesting observation is that, from the sqlalchemy logs, also the first request to a django
profile only results in one connection being checked out from the sqla QueuePool
and put back once. I.e. that is all that's needed for the QueryBuilder work.
This suggests that the 4 extra connection checkouts observed on sqlalchemy profiles (on first request) are caused by some initialization of the sqla database (e.g. I did notice alembic
queries in the logs) that could be entirely skipped when running the REST API.
Since Django is dropped for v2.0, I am closing this. Feel free to reopen if it persists in v2.0 nonetheless or if this is a critical problem that needs to be fixed on v1.6.x
.
In using the OPTIMADE Client I found an issue with the Materials Cloud AiiDA-OPTIMADE servers that is noted to be an issue of not closing 100+ PostgreSQL database connections (see related issue).
I was looking into the PR I created to handle the issue of closing SQLAlchemy sessions for the AiiDA REST API (#3974), and came across the following line's asymmetry between the backends, which might be fine, but might not be.
Django:
https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/backends/djsite/__init__.py#L65
SQLAlchemy:
https://github.com/aiidateam/aiida-core/blob/b002a9a9a898a05338ff522b46686657fb468f66/aiida/backends/sqlalchemy/__init__.py#L57
Could this be the cause of the issue? The database that caused the issue was indeed using a django backend.
Pinging @sphuber and @giovannipizzi