django / channels

Developer-friendly asynchrony for Django
https://channels.readthedocs.io
BSD 3-Clause "New" or "Revised" License
6.11k stars 800 forks source link

Nondeterministically unable to connect to remote postgresql instance after upgrade from Channels 1 to Channels 2.2.0 #1752

Closed agannon closed 3 years ago

agannon commented 3 years ago

Background I am currently migrating a legacy codebase from Django 1.11 & Channels 1 to Django 3.2 and Channels 3. Due to dependency issues I am doing it piecemeal in this fashion

  1. Upgrade Channels 1 to Channels 2.2
  2. Upgrade Django 1.11 to Django 2.2
  3. Upgrade Django and Channels to most recent stable version

After step one (Channels 2.2) I started noticing an error popping up mostly on our daphne and some Celery dynos on heroku:

Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/app/.heroku/python/lib/python3.6/site-packages/ddtrace/contrib/psycopg/patch.py", line 122, in patched_connect
    conn = connect_func(*args, **kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection timed out
    Is the server running on host "$CORRECT_HOST_ADDRESS" ($CORRECT_HOST_IP) and accepting
    TCP/IP connections on port 5433?
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/app/companies/app_views/webchat.py", line 49, in web_chat_user
    user_profile = UserProfile.objects.filter(web_cookie_id=web_cookie).first()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/models/query.py", line 653, in first
    for obj in (self if self.ordered else self.order_by('pk'))[:1]:
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/models/query.py", line 274, in __iter__
    self._fetch_all()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/models/query.py", line 1242, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/models/query.py", line 55, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
    cursor = self.connection.cursor()
  File "/app/.heroku/python/lib/python3.6/site-packages/ddtrace/contrib/trace_utils.py", line 161, in wrapper
    return func(mod, pin, wrapped, instance, args, kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/ddtrace/contrib/django/patch.py", line 96, in cursor
    cursor = func(*args, **kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/base/base.py", line 256, in cursor
    return self._cursor()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/base/base.py", line 233, in _cursor
    self.ensure_connection()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/app/.heroku/python/lib/python3.6/site-packages/ddtrace/contrib/psycopg/patch.py", line 122, in patched_connect
    conn = connect_func(*args, **kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: could not connect to server: Connection timed out
    Is the server running on host "$CORRECT_HOST_ADDRESS" ($CORRECT_HOST_IP) and accepting
    TCP/IP connections on port 5433?

I've replaced the actual host address with $CORRECT_HOST_ADDRESS for anonymity. But at random intervals our servers would query the postgresql database and get a connection timeout after not finding the database after ~1 minute. Even though it had the correct address. And these timeouts would cause dozens of requests being served by that web dyno to fail on heroku since they can only wait 30s before returning a 5XX error.

I have since upgraded to Django 2.2 in the hope that it was some regression with Django 1.11 and Channels 2.2 but the problem is still there.

Here is my current environment from pip list

aioredis==1.3.1           
amqp==5.0.6               
appdirs==1.4.3            
asgiref==3.3.4            
async-timeout==3.0.1      
attrs==21.2.0             
autobahn==19.3.3          
automat==0.7.0            
babel==2.6.0              
bandwidth-sdk==3.0.3      
beautifulsoup4==4.7.1     
billiard==3.6.4.0         
bleach==3.3.0             
blis==0.4.1               
boltons==19.1.0           
boto3==1.18.33            
botocore==1.21.33         
cached-property==1.5.1    
cachetools==3.1.0         
catalogue==1.0.0          
celery[redis]==5.1.2      
certifi==2019.3.9         
cffi==1.12.3              
channels-redis==3.2.0     
channels==2.2.0           
chardet==3.0.4            
charset-normalizer==2.0.4 
click-didyoumean==0.0.3   
click-plugins==1.1.1      
click-repl==0.2.0         
click==7.0                 
constantly==15.1.0        
croniter==0.3.34          
cryptography==3.4.7       
cymem==2.0.2              
daphne==2.5.0             
dataclasses==0.7          
dateparser==0.7.1         
datetimerange==0.5.5      
ddtrace==0.50.2           
defusedxml==0.6.0         
dill==0.2.9               
dj-database-url==0.5.0    
django-appconf==1.0.3     
django-axes==4.5.4        
django-bower==5.2.0       
django-compressor==2.2    
django-cors-headers==2.5.3
django-debug-toolbar==1.11
django-extensions==2.1.6  
django-htmlmin==0.11.0    
django-ipware==2.1.0      
django-json-response==1.1.5  
django-manifest-loader==1.0.0 
django-model-utils==3.1.2  
django-phonenumber-field==2.3.1 
django-ratelimit==2.0.0   
django-redis==4.10.0      
django-rest-auth==0.9.5   
django-rq==2.4.1          
django-static-precompiler==1.8.2  
django-test-without-migrations==0.6 
django==2.2.24             
djangoql==0.14.0          
djangorestframework==3.9.2
dnspython==1.16.0         
docker-py==1.10.6         
docker-pycreds==0.4.0     
dockerpty==0.4.1          
emoji==0.5.2              
eventlet==0.31.1          
freezegun==0.3.11         
future==0.16.0            
fuzzywuzzy==0.17.0        
google-api-core[grpc]==1.9.0  
google-api-python-client==1.7.8  
google-auth-httplib2==0.0.3  
google-auth==1.6.3        
google-cloud-core==0.29.1  
google-cloud-language==1.2.0  
google-cloud-speech==1.0.0  
google-gax==0.16.0        
googleapis-common-protos==1.5.9
googlemaps==3.0.2         
greenlet==0.4.15          
grpcio==1.20.0            
hiredis==2.0.0            
html5lib==1.0.1           
httplib2==0.12.3          
hyperlink==19.0.0         
icalendar==4.0.3          
idna==2.7                 
importlib-metadata==1.6.0 
incremental==17.5.0       
inflect==2.1.0            
ipaddress==1.0.22         
isodate==0.6.0            
jmespath==0.9.4           
kombu==5.1.0              
lxml==4.3.3               
mbstrdecoder[all]==0.7.0  
messengerbot==0.1.4       
mixpanel==4.3.2           
mock==3.0.5               
money==1.3.0              
msgpack==1.0.2            
murmurhash==1.0.2         
mypy-extensions==0.4.3    
mypy==0.761               
natsort==7.1.1            
nltk==3.4.1               
numpy==1.16.3             
oauth2==1.9.0.post1       
oauth2client==4.1.3       
packaging==21.0           
pandas==0.24.2            
parsedatetime==2.4        
pep562==1.1               
phonenumbers==8.12.30     
pillow==6.0.0             
plac==0.9.6               
ply==3.8                  
preshed==3.0.2            
prompt-toolkit==3.0.19    
protobuf==3.7.1           
psutil==5.7.0             
psycopg2==2.8.4           
py-mini-racer==0.1.18     
pyasn1-modules==0.2.5     
pyasn1==0.4.5             
pycparser==2.19           
pycryptodome==3.8.1       
pyhamcrest==1.9.0         
pyjwt==1.7.1              
pymessenger==0.0.7.0      
pynacl==1.3.0             
pyopenssl==20.0.1         
pyparsing==2.4.7          
pystache==0.5.4           
python-dateutil==2.8.0     
python-dotenv==0.10.1     
python-google-places==1.4.1
python-levenshtein==0.12.0
python3-saml==1.7.0       
pytz==2019.1              
raven==6.10.0             
rcssmin==1.0.6            
redis==3.5.3               
regex==2017.11.9          
requests-aws-sign==0.1.5  
requests-toolbelt==0.9.1  
requests==2.26.0          
rjsmin==1.0.12            
rq-scheduler==0.11.0      
rq==1.9.0                 
rsa==3.4.2                
s3transfer==0.5.0         
scikit-learn==0.20.3      
scipy==1.2.1              
sendgrid==1.5.20          
service-identity==18.1.0  
six==1.16.0               
slackclient==1.3.1        
smtpapi==0.2.0            
soupsieve==1.9.1          
spacy==2.2.4              
speechrecognition==3.8.1  
sqlparse==0.3.0           
sqreen==1.14.5            
squareconnect==2.20190410.0
srsly==1.0.2              
stripe==2.27.0            
suds-jurko==0.6           
tenacity==7.0.0           
thinc==7.4.0              
tinys3==0.1.11            
tqdm==4.45.0              
trello==0.9.4             
twilio==6.53.0            
twisted[tls]==19.2.0      
txaio==18.8.1             
typed-ast==1.4.1          
typepy[datetime]==0.4.0   
typing-extensions==3.7.4.1
tzlocal==1.5.1            
untangle==1.1.1           
uritemplate==3.0.0        
urllib3==1.25.2           
vine==5.0.0               
vobject==0.9.6.1          
wasabi==0.6.0             
wcwidth==0.2.5            
webencodings==0.5.1       
websocket-client==0.44.0  
whitenoise==4.1.2         
xmlsec==1.3.10            
yelp==1.0.2               
zeep==3.3.1               
zipp==3.1.0               
zope.interface==4.6.0     

Server definition: ddtrace-run daphne appname.asgi:application --port $PORT --bind 0.0.0.0 --verbosity 1

I was only able to replicate the timeout locally by submitting an incorrect IP address where the postgresql server was not running. But the logs would print the incorrect host that I passed to it. In my production environment, it is printing the correct host and port but still unable to connect.

I am hoping that the upgrade to Django 3 & Channels 3 will resolve this issue but in case it doesn't I would appreciate it if the django team could help diagnose this.

I know that our server is setup to accept connections at this address otherwise all our requests would fail and not just random ones.

I know that the ddtrace library is not the culprit (it is in the stacktrace) as I removed datadog from our stack temporarily to see if that would fix it and the behavior remained

We also see a lot of these log statements for our websockets when one of these events occurs

 at=error code=H13 desc="Connection closed without response" method=GET path="/chat/e1YwnyQBX4" host=host.com request_id=uuid fwd="xxx" dyno=web.14 connect=0ms service=4321ms status=503 bytes=0 protocol=https
carltongibson commented 3 years ago

Hi @agannon — short of a minimal reproduce, there's very little I can say here. It could be any one of many issues somewhere in the stack. If you can narrow it down to something in Channels then I'm happy to take a look, but otherwise it's just too open ended. Sorry.