freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
538 stars 148 forks source link

ConnectionError: Error while reading from socket: (104, 'Connection reset by peer') #1448

Open sentry-io[bot] opened 3 years ago

sentry-io[bot] commented 3 years ago

We've seen an explosion in Redis connections failing with the python3 conversion.

Sentry Issue: COURTLISTENER-H4

ConnectionResetError: [Errno 104] Connection reset by peer
  File "redis/connection.py", line 396, in read_response
    bufflen = recv_into(self._sock, self._buffer)
  File "redis/_compat.py", line 61, in recv_into
    return sock.recv_into(*args, **kwargs)

ConnectionError: Error while reading from socket: (104, 'Connection reset by peer')
(16 additional frame(s) were not displayed)
...
  File "redis/client.py", line 1264, in get
    return self.execute_command('GET', name)
  File "redis/client.py", line 775, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "redis/client.py", line 789, in parse_response
    response = connection.read_response()
  File "redis/connection.py", line 637, in read_response
    response = self._parser.read_response()
  File "redis/connection.py", line 408, in read_response
    raise ConnectionError("Error while reading from socket: %s" %
flooie commented 3 years ago

Interestingly, I would point out that the error only appears to affect dockets.

mlissner commented 3 years ago

Hm, troubling. I don't know why it'd only affect dockets, nor why anything would be different here. The only change to networking that's relevant is that our python code is now dockerized, so it has to go through some network hoops to arrive at redis.

One thing that has changed with redis is that it has about 17GB of stuff in it all of a sudden, instead of the paltry amounts it had before. That 17GB of stuff is related to the 1.1M failed IA uploads, since most failed celery tasks get stored in Redis for some number of hours. 17GB still isn't much in the scheme of things, but maybe it's playing a role.

This one seem hard to diagnose and fix. One strategy could be retries, but those have their own issues.

mlissner commented 3 years ago

Via the Sentry timeline it looks like this was resolved. I think the resolution was to clear out redis in issue #1460.

mlissner commented 3 years ago

This issue is still kind of around sometimes, though it's infrequent. The solution seems to be to get health_check_interval landed into django-redis-cache. The issue for that is closed atm, but it's here: https://github.com/sebleier/django-redis-cache/issues/184