Idle database connections not reaped

andrewgodwin commented 6 years ago

Initially reported in https://github.com/django/daphne/issues/163

I suspect this is due to the lack of calling close_old_connections, as seen in Channels 1 here: https://github.com/django/channels/blob/1.x/channels/signals.py#L10

andrewgodwin commented 6 years ago

@michaelhays If you want to verify this would fix it, you can try calling close_old_connections yourself at the end of your consumer's method (but likely not in disconnect, as https://github.com/django/daphne/issues/152 will prevent that). I'll look at solving both issues later in the week.

michaelhays commented 6 years ago

close_old_connections didn't work for me as I used it. I tried it in a couple of places.

End of the consumer method:

# consumers.py
class ChatbotConsumer(JsonWebsocketConsumer):
    def connect(self):
        self.accept()

    def receive_json(self, content):
        # create Chat in scope if one doesn't exist
        if 'chat' not in self.scope:
            chat = Chat.objects.create_chat(self.scope['user'])
            self.scope['chat'] = chat

        # process chatbot response, store messages, and send response to client
        response_data = some_func(content['message'])
        self.scope['chat'].add_message(content['message'])
        self.scope['chat'].add_message(response_data)
        self.send_json(response_data)
        close_old_connections()

End of the model method:

# models.py
class Chat(models.Model):
    body = JSONField()
    ...

    def add_message(self, message):
        self.body.append(message)
        self.save()
        close_old_connections()

Setting CONN_MAX_AGE to 0 has helped a bit, in that certain actions like reloading the page removes 2 or 3 of the "COMMIT" idle connections (but, only sometimes?), and occasionally, the idle connections are reaped. But I've yet to notice any rhyme or reason to it.

andrewgodwin commented 6 years ago

Probably need the startup method connected as well. I'll look into it more tomorrow when I have time to work on Channels!

andrewgodwin commented 6 years ago

OK, I have replicated this, and the temporary workaround is calling django.db.close_old_connections right after you do an ORM call, inside the same sync context (calling it anywhere else hides the connections due to thread safety).

Working on a proper fix now.

andrewgodwin commented 6 years ago

OK, this should be fixed in 6afd423 - any code inside of a SyncConsumer will work fine, and anything you're calling from an async context should use the new channels.db.database_sync_to_async (can either be called with a function or be used as a function/method decorator). You'll also need the latest release of asgiref (2.1.4) to make method decorators work correctly.

If someone could confirm this fixes it for them, I can close this.

michaelhays commented 6 years ago

Confirmed to fix the issue for me!

andrewgodwin commented 6 years ago

Great! I'll put this out in a release in the next day or two, depending on what else I land.

LeonidIvanov commented 6 years ago

Hey, @andrewgodwin ! I'm steel having the same problem. And it's not only about consumer. I was following your tutorial (multichat). I have database_sync_to_async decorator that covers get_room_or_error in chat.utils. The main thing is that every page refresh (not only chat page) creates new database connection! Django + PostgreSQL + Daphne + Heroku

andrewgodwin commented 6 years ago

If you can reduce it to a simple set of code that reliably reproduces the problem, using the latest versions of all of the repositories, then please open a new issue with that code snippet, and steps to replicate locally (I can't debug things that only happen in Heroku)

LeonidIvanov commented 6 years ago

It's exact same issue as @michaelhays described at django/daphne#163. In my case there is no such issue in local but it is in Heroku. And I understand that you can't debug things that only happen in Heroku... But there's no support where I can text with that issue. You have more influence to report them about it as more people who updating from channels 1.0 to 2.0 will suffer this problem...

LeonidIvanov commented 6 years ago

I just recreated everything on your clear multichat example here and deployed to heroku https://django-channels-2-test.herokuapp.com/ (user: demo, pass: demo) and get same issue. Refresh page 20 times and you'll get

OperationalError at /
FATAL:  too many connections for role "jdknihvnjyczlm"

andrewgodwin commented 6 years ago

Can you confirm what exact versions of each channels package you're running using a pip freeze from the Heroku environment? I just tried this again locally with multichat and verified that it only ever opens a single database connection.

LeonidIvanov commented 6 years ago

aioredis==1.0.0
asgiref==2.1.6
async-timeout==2.0.0
attrs==17.4.0
autobahn==17.10.1
Automat==0.6.0
channels==2.0.2
channels-redis==2.1.0
constantly==15.1.0
daphne==2.0.4
dj-database-url==0.4.2
Django==2.0.2
hiredis==0.2.0
hyperlink==18.0.0
idna==2.6
incremental==17.5.0
msgpack==0.5.6
psycopg2==2.7.4
pytz==2018.3
six==1.11.0
Twisted==17.9.0
txaio==2.8.2
whitenoise==3.3.1
zope.interface==4.4.3

Looks like packages in heroku are exactly the same as in local env... And yes, the problem only appears at Heroku, works fine at local though...

andrewgodwin commented 6 years ago

Unless I can replicate it locally I'm afraid I can't fix it, sorry - I'm not sure what's different about the Heroku environment that would cause this.

kennethreitz commented 6 years ago

Hmmmm

burcaw commented 6 years ago

Same issue, Django + PostgreSQL + Daphne + Heroku.

Originally had this issue with the consumers, but was fixed with the prior release.

On any DB lookup within the Django app, postgres connections are left idle and never close. This occurs on any DB lookup, not the very narrow area where I use channels.

Relevant versions: asgiref==2.2.0 channels==2.0.2 channels-redis==2.1.0 daphne==2.1.0 dj-database-url==0.5.0 Django==2.0 psycopg2==2.7.4

One additional item, I verified that these idle connections don't close after the conn_max_age 500 second have elapsed.

db_from_env = dj_database_url.config(conn_max_age=500) DATABASES['default'].update(db_from_env)

UPDATE: Perhaps this article provides a clue: https://devcenter.heroku.com/articles/forked-pg-connections

kennethreitz commented 6 years ago

what happens if you drop conn_max_age from the dj_database_url call?

seanwiseman commented 6 years ago

I was experiencing the exact same problem @burcaw described. I can confirm @kennethreitz, dropping conn_max_age from the dj_database_url.config call has resolved the issue for me. No more idle db connections are left open 👍.

kennethreitz commented 6 years ago

✨🍰✨

bsr3000 commented 6 years ago

aioredis==1.0.0
asgi-redis==1.3.0
asgiref==2.3.0
async-timeout==2.0.1
attrs==17.4.0
autobahn==18.4.1
Automat==0.6.0
channels==2.1.1
channels-redis==2.1.1
daphne==2.1.1
Django==1.11
psycopg2==2.7.4
six==1.11.0
Twisted==17.9.0
txaio==2.9.0
txredisapi==1.4.4
ujson==1.35
uWSGI==2.0.15
zope.interface==4.4.3

Hello, I have encountered the same issue as @burcaw , Nginx + Uwsgi(WSGI Layer) + Daphne (only ASGI Layer), I have already updated all packages to latest ones, but even when I dropped CONN_MAX_AGE, it creates and keeps idle connections to database, without terminating them, so if few clients connect to websockets, postgres reaches the max connection limit.

Is there any workaround for current issue?

Thanks in advance

andrewgodwin commented 6 years ago

If you are still having the issue it means you are using code that talks to a database outside of database_sync_to_async. Make sure all your database access is wrapped in that (consumer methods on sync consumers already are).

bsr3000 commented 6 years ago

Thanks for your reply, I have double checked my code and didn`t found any database access in methods without database_sync_to_async decorator.

So the problem was in my server configuration and ASGI_THREADS env var:

I didn`t set an environment variable in docker image
I had 10 instances of daphne, with max workers set to None (if I understand well, each daphne had it's own Threadpool set to default max_workers (os.cpu_count() or 1) * 5 ) and the test server a was running on os.cpu_count() was 16, so max worker count for each daphne server were 80, so that`s why I had so many connections to database.
Setting ASGI_THREADS and "CONN_MAX_AGE": 60 works fine now, each worker reuses connections to database and cleans everything well, on my current settings ASGI_THREADS=20 and 4 daphne instances on different ports, number of database connections doesn`t exceed more than 45, while load testing them using artillery.io

P.S Maybe it can be usefull to add some info about ASGI_THREADS var to documentation

Also one more question, while load testing with an external tool, sometimes I get

ERROR    Exception in callback AsyncioSelectorReactor.callLater.<locals>.run() at /venv/lib/python3.5/site-packages/twisted/internet/asyncioreactor.py:287
handle: <TimerHandle when=1713807.003341125 AsyncioSelectorReactor.callLater.<locals>.run() at /venv/lib/python3.5/site-packages/twisted/internet/asyncioreactor.py:287>
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/events.py", line 126, in _run
    self._callback(*self._args)
  File "/venv/lib/python3.5/site-packages/twisted/internet/asyncioreactor.py", line 290, in run
    f(*args, **kwargs)
  File "/venv/lib/python3.5/site-packages/twisted/internet/tcp.py", line 289, in connectionLost
    protocol.connectionLost(reason)
  File "/venv/lib/python3.5/site-packages/autobahn/twisted/websocket.py", line 128, in connectionLost
    self._connectionLost(reason)
  File "/venv/lib/python3.5/site-packages/autobahn/websocket/protocol.py", line 2467, in _connectionLost
    WebSocketProtocol._connectionLost(self, reason)
  File "/venv/lib/python3.5/site-packages/autobahn/websocket/protocol.py", line 1096, in _connectionLost
    self._onClose(self.wasClean, WebSocketProtocol.CLOSE_STATUS_CODE_ABNORMAL_CLOSE, "connection was closed uncleanly (%s)" % self.wasNotCleanReason)
  File "/venv/lib/python3.5/site-packages/autobahn/twisted/websocket.py", line 171, in _onClose
    self.onClose(wasClean, code, reason)
  File "/venv/lib/python3.5/site-packages/daphne/ws_protocol.py", line 146, in onClose
    self.application_queue.put_nowait({
AttributeError: 'WebSocketProtocol' object has no attribute 'application_queue'

Is it a channels issue or just some problem in testing tool, that it doesn`t close the connections properly to ws?

andrewgodwin commented 6 years ago

Ah, yes. I will add notes about the number of possible threads to the documentation.

As for that error, it looks like it may be a bug in Daphne - check you are running the most recent version, and if you are, please open an issue on the Daphne issue tracker with that traceback and any more information you have.

bsr3000 commented 6 years ago

Ok, no problem, I am using latest release version of daphne other channels related packages, currently looks like, this error appears, when there are more connections than daphne server cluster can handle, current setup is 10 daphne instances, each set to ASGI_THREADS=8, I ll create an issue on daphne issue tracker with that error code. Currently, my setup can handle 10 new connections per second, using AsyncConsumers, if I set more, that error starts to appear.

alsib commented 6 years ago

Hi guys, catching up on the issue. Does this mean there is no way of running an API on the same server as Channels ? Or would that mean that I would have to wrap all my api calls in @database_sync_to_async ? I'm running on Heroku and my consumer (AsyncWebsocketConsumer) is using the decorator

EDIT : Looks solved. Removing CONN_MAX_AGE apparently did the trick. https://stackoverflow.com/questions/48644208/django-postgresql-heroku-operational-error-fatal-too-many-connections-for-r

iMerica commented 5 years ago

I don't have a conn_max_age set on my dj_database_url() call and yet I still see lots of idle un-reaped Postgress connections.

Any ideas on where to look?

Relavant Versions:

channels-redis==2.3.3
channels==2.1.7
django==2.1.4
dj-database-url==0.5.0

vezzick commented 5 years ago

dropping conn_max_age still didn't fix the issue for me, I am having this issue for rest https requests under daphne, not the websocket connections. I don't have this issue under wsgi. Would I have to set up a new server for the websocket connections and stay with wsgi for the rest framework?

channels==2.1.7 channels-redis==2.3.3 Django==2.0.6 dj-database-url==0.5.0

vezzick commented 5 years ago

nvm my mistake, I was setting MAX_CONN_AGE through heroku settings: django_heroku.settings(locals()) which has been mentioned in https://github.com/heroku/django-heroku/issues/11

SHxKM commented 5 years ago

For me, the only thing that worked was setting ASGI_THREADS (an an env var) to a very low number (6). Daphne never closes these connections by the way. I'm not sure if this is the expected behaviour, @andrewgodwin?

andrewgodwin commented 5 years ago

Daphne and Channels both don't have enough visibility into the threadpool system to close any connections in there, sadly. This is something that has to be remedied in Django itself, which I am in the process of.

vkolova commented 5 years ago

I removed conn_max_age, tried with close_old_connections, as well as setting ASGI_THREADS to an even lower number (3) and the issue is still here for me for both rest requests and websockets.

channels-redis==2.3.3
channels==2.1.7
dj-database-url==0.5.0
Django==2.1.7

LgArcida commented 5 years ago

Any update with this issue? @andrewgodwin

carltongibson commented 5 years ago

@LgArcida It's a big project. The first steps are made here: https://github.com/django/deps/blob/master/draft/0009-async.rst

joshmarlow commented 5 years ago

For those who keep finding this issue, here's what I did:

I added close_old_connections calls here and there (I had some custom authentication code where I was authenticating using DRF tokens). This helped but the problem still popped up immediately.
I started using pgbouncer (which I was planning on doing anyway) - https://devcenter.heroku.com/articles/python-concurrency-and-database-connections#limit-connections-with-pgbouncer

I then was able to send 1,000 messages through my channels code on Heroku and it all worked fine (previously I would have issues after 20-30 messages).

SHxKM commented 5 years ago

I have to say I'm still a little confused by this issue, even though setting ASGI_THREADS to a low number has "solved" the issue for me. I'm using a synchronous consumer.

Where am I supposed to use close_old_connections()? Here's my workflow with channels:

User connects, starts a long-running celery task
Celery task writes to DB, possibly thousands of times, calls Channels consumer each time with details of (updated) message to be sent to client
Consumer sends a JSON string to the client each time a message is sent by Celery task, update is displayed to the user

Where does close_old_connections() go here?

Paulcappaert commented 4 years ago

I was having this problem with channels and heroku postgres but it seems to have resolved itself when I changed my runtime from 3.6 to 3.8.1. I don't need to use close_old_connections()

carltongibson commented 4 years ago

@Paulcappaert I don’t suppose we have any idea what changed there do we?

Paulcappaert commented 4 years ago

@carltongibson No idea, all I know is that I made that single change and the problem went away.

carltongibson commented 4 years ago

Hmmm. Yes... 🙂 Thanks for the follow-up.

(asyncio is changing rapidly, still, but 🤷‍♂️)

sagardspeed2 commented 4 years ago

@carltongibson No idea, all I know is that I made that single change and the problem went away.

which change do you perform so problem went away ?

Paulcappaert commented 4 years ago

@carltongibson No idea, all I know is that I made that single change and the problem went away.

which change do you perform so problem went away ?

I updated the runtime to python 3.8.1

sagardspeed2 commented 4 years ago

@carltongibson No idea, all I know is that I made that single change and the problem went away.

which change do you perform so problem went away ?

I updated the runtime to python 3.8.1

ok thanks, i also updated runtime to python 3.8.2 and done

Hardeepsingh980 commented 4 years ago

I was having this problem with channels and heroku postgres but it seems to have resolved itself when I changed my runtime from 3.6 to 3.8.1. I don't need to use close_old_connections()

You now what? i was struggling from last two days on this stupid error. I am glad you fixed it. Thanks a lot. I am really happy

SHxKM commented 4 years ago

I'm happy to see @Paulcappaert's suggestion to switch runtimes helped some here, unfortunately for some of us that's impossible as we may be using other libraries that don't support 3.8 yet. I for one rely on Celery for background processing and cannot really update.

andrewgodwin commented 4 years ago

For what it's worth, merely getting off of 3.6 may be enough - asyncio was a little underbaked in 3.6, and is much better in 3.7, and so most issues just need 3.7, not 3.8.

SHxKM commented 4 years ago

For what it's worth, merely getting off of 3.6 may be enough - asyncio was a little underbaked in 3.6, and is much better in 3.7, and so most issues just need 3.7, not 3.8.

OK, just a heads up that from my experience - that's not true. I was (and still am) using Channels 1, and when I upgraded to Python 3.7.2 to enjoy a Celery feature, I also had to upgrade to Channels 2 (all sorts of dependency clashes). That's when I encountered this issue again, and reverted my whole setup so Channels 2 didn't kill my server. I am hoping to give this upgrade another chance soon enough, after multiple failed attempts.

I'm hosted at Heroku; YMMV.

andrewgodwin commented 4 years ago

Channels 2 versus Channels 1 is a very different conversation given that they use completely different asynchronous styles under the hood.

Generally, this issue is caused by Django not identifying the connections with threads or coroutines that are already-dead. For Django itself this was fixed in the latest asgiref release, but Channels does not use the Local implementation in there so it doesn't get the same fixes. I also feel like Heroku does something to the Python environment based on a whole slew of Heroku-specific bugs that I used to get, but I have no time to verify that, unfortunately.

SHxKM commented 4 years ago

Channels 2 versus Channels 1 is a very different conversation given that they use completely different asynchronous styles under the hood.

Generally, this issue is caused by Django not identifying the connections with threads or coroutines that are already-dead. For Django itself this was fixed in the latest asgiref release, but Channels does not use the Local implementation in there so it doesn't get the same fixes. I also feel like Heroku does something to the Python environment based on a whole slew of Heroku-specific bugs that I used to get, but I have no time to verify that, unfortunately.

Let me clarify. Maybe I shouldn't have mentioned Channels 1 at all. What I was talking about is Channels 2 on Python 3.7.2, it's why I had to revert back to Channels 1, which is working perfectly fine.

phamhm commented 4 years ago

I'm using Python 3.7.6 on heroku. My main server runs fine on with guinicorn. I observed this issue when I run daphne on my dev server.

On my dev server, I incorporated Channels 2 + Websocket and daphne into my app and saw this issue whenever I refreshed my page. I changed my app to use AJAX instead and still see this issue. I only have one view function that return JSONresponse for the AJAX call. That AJAX view function was decorated with 'login_required' and a database update. So i did the following:

Using AJAX + daphne:

comment out the database update in my Ajax function => same issue with multiple ajax calls.
comment out the login_require decorator on my ajax function=> no fatal on multiple ajax calls.

Using websocket(Channels2) + daphne

20+ Websocket calls with no database update in WebSocketConsumer=> no fatal
20+ Websocket calls with database update in WebSocketConsumer => fatal
Refresh the login_required view 20+ times: fatal.

I looked at the heroku logs and found that in all instances, the fatal occurred when Django was doing session validation. I don't observe this behavior on my main server when I refresh the login_required view 20+ times. My main server uses guinicorn.

File "/app/.heroku/python/lib/python3.7/site-packages/django/contrib/auth/__init__.py", line 58, in _get_user_session_key
return get_user_model()._meta.pk.to_python(request.session[SESSION_KEY])
File "/app/.heroku/python/lib/python3.7/site-packages/django/contrib/sessions/backends/base.py", line 64, in __getitem__
return self._session[key]

I believe that Django session management is causing this issue when I use daphne. When either Django View or a WebSocket consummer requires validation, a connection is opened to the DATABASE but it never closes. This behavior is not apparent on the local machine because we don't have the limitation on database connection.

carltongibson commented 4 years ago

Hi. Are you able to reproduce on a laptop, or only on Heroku? (That still seems to be the issue...)

django / channels

Idle database connections not reaped #871