Closed agannon closed 3 years ago
Hi @agannon — short of a minimal reproduce, there's very little I can say here. It could be any one of many issues somewhere in the stack. If you can narrow it down to something in Channels then I'm happy to take a look, but otherwise it's just too open ended. Sorry.
Background I am currently migrating a legacy codebase from Django 1.11 & Channels 1 to Django 3.2 and Channels 3. Due to dependency issues I am doing it piecemeal in this fashion
After step one (Channels 2.2) I started noticing an error popping up mostly on our
daphne
and some Celery dynos on heroku:I've replaced the actual host address with $CORRECT_HOST_ADDRESS for anonymity. But at random intervals our servers would query the postgresql database and get a connection timeout after not finding the database after ~1 minute. Even though it had the correct address. And these timeouts would cause dozens of requests being served by that web dyno to fail on heroku since they can only wait 30s before returning a 5XX error.
I have since upgraded to Django 2.2 in the hope that it was some regression with Django 1.11 and Channels 2.2 but the problem is still there.
Here is my current environment from pip list
Server definition:
ddtrace-run daphne appname.asgi:application --port $PORT --bind 0.0.0.0 --verbosity 1
I was only able to replicate the timeout locally by submitting an incorrect IP address where the postgresql server was not running. But the logs would print the incorrect host that I passed to it. In my production environment, it is printing the correct host and port but still unable to connect.
I am hoping that the upgrade to Django 3 & Channels 3 will resolve this issue but in case it doesn't I would appreciate it if the django team could help diagnose this.
I know that our server is setup to accept connections at this address otherwise all our requests would fail and not just random ones.
I know that the
ddtrace
library is not the culprit (it is in the stacktrace) as I removed datadog from our stack temporarily to see if that would fix it and the behavior remainedWe also see a lot of these log statements for our websockets when one of these events occurs