aio-libs-abandoned / aioredis-py

asyncio (PEP 3156) Redis support
https://aioredis.readthedocs.io/
MIT License
2.3k stars 336 forks source link

ConnectionResetError: [Errno 104] Connection reset by peer #778

Open zzlpeter opened 4 years ago

zzlpeter commented 4 years ago

Hi, everybody! I use torando+aioredis, and recently i met this issue, below is traceback my environ: aioredis==1.2.0 tornado==5.1.1 I use this method aioredis.create_redis_pool(**args) to create pool can anybody show me help? thx a lot.

`Traceback (most recent call last):
   File "/usr/local/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
 result = await result
 File "/views/notice.py", line 341, in get
   items, page_size, total_page, total_size = await Notice.cache_or_api_list(notice_id_list, page_count, page_size)
 File "models/notice.py", line 136, in cache_or_api_list
   items = await cls.query_list(page_list)
 File "models/notice.py", line 92, in query_list
   items = await asyncio.gather(*[Notice.cache_or_api(notice_id) for notice_id in notice_id_list])
 File "models/notice.py", line 37, in cache_or_api
   info = await redis.execute('get', redis_key)
 File "models/notice.py", line 37, in cache_or_api
   info = await redis.execute('get', redis_key)
 File "models/notice.py", line 37, in cache_or_api
   info = await redis.execute('get', redis_key)
 [Previous line repeated 11 more times]
 File "/usr/local/lib/python3.6/site-packages/aioredis/connection.py", line 183, in _read_data
   obj = await self._reader.readobj()
 File "/usr/local/lib/python3.6/site-packages/aioredis/stream.py", line 94, in readobj
   await self._wait_for_data('readobj')
 File "/usr/local/lib/python3.6/asyncio/streams.py", line 464, in _wait_for_data
   yield from self._waiter
 File "/usr/local/lib/python3.6/asyncio/selector_events.py", line 723, in _read_ready
   data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer`
CL545740896 commented 4 years ago

i had the same ConnectionResetError

seandstewart commented 3 years ago

Can you please check out the latest master and test with that? Note that as of #891 the client has the same API as redis-py.

djstein commented 3 years ago

getting a few thousand of these a day when using Django Channels tested with Python 3.8.1 and Python 3.9.2 using aioredis 1.3.1 installed as child of channels

Andrew-Chen-Wang commented 3 years ago

@djstein Are you getting this on production? If you're experiencing this in development, please refer to #930 for migration needs as v1 major version will not get any fixes.

rushilsrivastava commented 3 years ago

This error still exists in aioredis==2.0.0 in production.

RonaldinhoL commented 3 years ago

i had the same ConnectionResetError

RonaldinhoL commented 3 years ago

from my test, when this occured, aioredis / redis-py / asyncio-redis cann't connect, but aredis can do, what is the difference in it?

eneloop2 commented 3 years ago

Hi all, the problem lies with the ConnectionError type. aioredis implements its own ConnectionError(builtins.ConnectionError, RedisError which causes problems because ConnectionResetError is no longer a subclass of ConnectionError. If one looks at line 37 of client.py, it overwrites ConnectionError so that the exception can no longer be caught on line 1067....

shaakaud commented 3 years ago

Any workaround found for this issue ?

Enchufa2 commented 2 years ago

Same issue here in production. There's a firewall resetting connections after some time, and aioredis won't recover from this, which is a serious problem. Is there any known workaround?

Enchufa2 commented 2 years ago

Please, consider this issue in #1225.

rushilsrivastava commented 2 years ago

This issue can be closed, it should've been solved with #1129.

Enchufa2 commented 2 years ago

No, sorry, it doesn't solve this issue. Quick test:

  1. Run a Redis instance locally:
$ docker run --rm -d -p 6379:6379 --name redis redis
  1. Run a simple test.py program that continually reads something from Redis. For example (ugly, but easy to translate to redis):
from aioredis import Redis
import asyncio, time

redis = Redis()
loop = asyncio.get_event_loop()
loop.run_until_complete(redis.set("something", "something"))

while True:
  print(loop.run_until_complete(redis.get("something")))
  time.sleep(5)
  1. Kill the connection (technically this is a connection abort, not a connection reset, but it has the same effect and unveils the same underlying issue):
$ sudo ss -K dst [::1] dport = redis
Traceback (most recent call last):
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 815, in send_packed_command
    await asyncio.wait_for(
  File "/usr/lib64/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 797, in _send_packed_command
    await self._writer.drain()
  File "/usr/lib64/python3.9/asyncio/streams.py", line 387, in drain
    await self._protocol._drain_helper()
  File "/usr/lib64/python3.9/asyncio/streams.py", line 190, in _drain_helper
    raise ConnectionResetError('Connection lost')
ConnectionResetError: Connection lost

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "***/test.py", line 9, in <module>
    print(loop.run_until_complete(redis.get("something")))
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/client.py", line 1063, in execute_command
    await conn.send_command(*args)
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 840, in send_command
    await self.send_packed_command(
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 829, in send_packed_command
    raise ConnectionError(
aioredis.exceptions.ConnectionError: Error UNKNOWN while writing to socket. Connection lost.
Traceback (most recent call last):
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 762, in disconnect
    await self._writer.wait_closed()
  File "/usr/lib64/python3.9/asyncio/streams.py", line 359, in wait_closed
    await self._protocol._get_close_waiter(self)
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 815, in send_packed_command
    await asyncio.wait_for(
  File "/usr/lib64/python3.9/asyncio/tasks.py", line 442, in wait_for
    return await fut
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 797, in _send_packed_command
    await self._writer.drain()
  File "/usr/lib64/python3.9/asyncio/streams.py", line 375, in drain
    raise exc
  File "/usr/lib64/python3.9/asyncio/selector_events.py", line 856, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionAbortedError: [Errno 103] Software caused connection abort

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "***/test.py", line 9, in <module>
    print(loop.run_until_complete(redis.get("something")))
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/client.py", line 1063, in execute_command
    await conn.send_command(*args)
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 840, in send_command
    await self.send_packed_command(
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 829, in send_packed_command
    raise ConnectionError(
aioredis.exceptions.ConnectionError: Error 103 while writing to socket. Software caused connection abort.

So we switched error messages, but the issue persists.

Enchufa2 commented 2 years ago

A workaround would be to call redis.connection_pool.disconnect() before performing any operation before/after a long pause where a reset may happen.

Andrew-Chen-Wang commented 2 years ago

@Enchufa2 thanks for running the test. What do you mean automagically? Is there code that they implement that we somehow missed or code that we did port that is not functioning properly? Does internal connection not handle this properly?

Enchufa2 commented 2 years ago

I mean that redis somehow figures out that the connection is broken, disposes of it and opens a new one instead of failing. I'm not sure how redis does this and therefore what's missing here, because unfortunately I'm not familiar with either codebases. But redis's behaviour is what I would expect from a connection pool. Otherwise, one needs to think about whether there are connections and whether they are alive, which is contrary to the very abstraction of a connection pool, right?

Andrew-Chen-Wang commented 2 years ago

@Enchufa2 can you change this:

https://github.com/aio-libs/aioredis-py/blob/dbdd0add63f986f2ed2d56c9736303d133add23c/aioredis/connection.py#L850

to if not self.is_connected:

Redis is checking using self._sock, but we don't use self._sock. This could be the underlying reason, though I'm not sure how we didn't catch this early on or if a PR changed this somehow.

Enchufa2 commented 2 years ago

Nope, this doesn't help, same error. By the way, I noticed that I switched the errors coming from the current release and the current master branch in my previous comment. Apologies, the comment is amended now.

Enchufa2 commented 2 years ago

This is interesting. It turns out I had redis v3.5.3 installed, and this is the version that recovers from connection aborts and resets. I just updated to v4.0.2 and it shows this same issue.

Enchufa2 commented 2 years ago

Reported in https://github.com/redis/redis-py/issues/1772

Andrew-Chen-Wang commented 2 years ago

@Enchufa2 I'm unable to reproduce this error in both redis main/master branch and aioredis==2.0.0. https://github.com/Andrew-Chen-Wang/aioredis-issue-778 unlike ss, I'm using CLIENT KILL and CLIENT KILL ADDR with no success. When doing CLIENT LIST, in both cases, a new connection is established, and this is shown via an incremented ID. I was using gitpod since I don't have a tool similar to ss on Mac for killing sockets. There may be a chance CLIENT KILL gave aioredis a warning beforehand, but if a Connection reset by peer is occurring, then I can't imagine functionality similar to CLIENT KILL not being performed.

Enchufa2 commented 2 years ago

If I use CLIENT KILL, I see this with the current master branch:

Traceback (most recent call last):
  File "***/test.py", line 9, in <module>
    print(loop.run_until_complete(redis.get("something")))
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/client.py", line 1064, in execute_command
    return await self.parse_response(conn, command_name, **options)
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/client.py", line 1080, in parse_response
    response = await connection.read_response()
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 854, in read_response
    response = await self._parser.read_response()
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 367, in read_response
    raw = await self._buffer.readline()
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 301, in readline
    await self._read_from_socket()
  File "/home/***/.local/lib/python3.9/site-packages/aioredis/connection.py", line 250, in _read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
aioredis.exceptions.ConnectionError: Connection closed by server.
cjdsellers commented 2 years ago

We were also seeing this in production with aioredis==2.0.0. Our solution was to remove the connection pool and rely on single connections (although this isn't ideal).

@Andrew-Chen-Wang I see you were unable to reproduce, is there any current plan to look into this further? Or would you like someone to put together a PR with a potential fix?

Andrew-Chen-Wang commented 2 years ago

@cjdsellers I still can't reproduce this issue, so a PR is much appreciated as we're down a maintainer now and I'm stuck with work.

cjdsellers commented 2 years ago

I totally understand regarding the bandwidth. Maybe @Enchufa2 could chime in here with a solution strategy, as he seems to be very familiar with the issue?

Andrew-Chen-Wang commented 2 years ago

Tracking https://github.com/redis/redis-py/issues/1789

Enchufa2 commented 2 years ago

@cjdsellers My bet is that a possible solution would be to catch OSError in https://github.com/aio-libs/aioredis-py/blob/224f843bd4b33d657770bded6f86ce33b881257c/aioredis/connection.py#L1423 but I haven't had the time to test this yet. I'm also stuck with work :(, and the workaround in https://github.com/aio-libs/aioredis-py/issues/778#issuecomment-984654896, although not ideal, works well.

cjdsellers commented 2 years ago

OK, I could put together a PR for you guys to look at. I'll see what I can do over this festive period.

@Enchufa2 In the mean time I did try the workaround as you suggested in your comment, however maybe my implementation wasn't correct because the issue remained. If you're successfully using this work around, then some sort of code snippet would be much appreciated!

Enchufa2 commented 2 years ago

OK, I could put together a PR for you guys to look at. I'll see what I can do over this festive period.

There are situations in which aioredis fails and redis does not (for instance, after a CLIENT KILL, see https://github.com/redis/redis-py/issues/1772). So keep in mind that aioredis may have additional issues not present in redis. In this regard, I would try a fix in redis in the first place, and then port it here.

@Enchufa2 In the mean time I did try the workaround as you suggested in your comment, however maybe my implementation wasn't correct because the issue remained. If you're successfully using this work around, then some sort of code snippet would be much appreciated!

Currently I'm using aioredis in a microservice that makes a snapshot of certain information every n minutes. The code is not public, but this is essentially what it does:

async def set_snapshot():
    # ...
    try:
        s = snapshot.get()
        await redis.set("snapshot-identifier", s)
    finally:
        # workaround for https://github.com/aio-libs/aioredis-py/issues/778
        await redis.connection_pool.disconnect()

As I said, this fragment is currently called every n minutes, so destroying the pool for every call doesn't really have so much impact in my case. My original issue started when the infra people set up a firewall to automatically abort long-lasting connections.

Enchufa2 commented 2 years ago

So this simple fix https://github.com/redis/redis-py/pull/1832 solves the issues for redis. That patch is required here but not sufficient, because the situation is more complex, as aioredis did not catch up on how redis manages command execution. redis makes unconditional retries, e.g., here

https://github.com/redis/redis-py/blob/04b8d34e212723974b9b1f484fe7cd9e93f0e315/redis/client.py#L1171-L1175

However, aioredis fails unless the error is a TimeoutError here

https://github.com/aio-libs/aioredis-py/blob/a28e0e9a21f6b645952d888e65f29b433218fba5/aioredis/client.py#L1088-L1089

Porting https://github.com/redis/redis-py/pull/1832 and commenting out those two lines above solves this issue. But that's not a solution, obviously. Therefore, aioredis would need to port the Retry class from redis first, or a similar mechanism, and then modify all the client functions to use call_with_retry.

Andrew-Chen-Wang commented 2 years ago

Thank you for identifying the solution Enchufa2.

Yes, we're... fairly far behind, and I apologize for that. Hoping to reboot #1156 sometime soon (maybe once my new semester begins), but these ports do take quite awhile unfortunately. I'm hoping to get these ports in through a more incremental approach since #1156 stalled.

Andrew-Chen-Wang commented 2 years ago

@Enchufa2 please test this on #1156. It should be up to date. I'm hoping to push out aioredis 2.1.0 as soon as #1156 is merged. Thanks for the patience everyone!

Enchufa2 commented 2 years ago

1156 recovers from CLIENT KILL (which is good, redis did that) and still raises an error for connection aborts and resets (as expected, and even if redis/redis-py/pull/1832 is implemented).

Andrew-Chen-Wang commented 2 years ago

1156 only goes up to 4.0.1. I've starting a new PR to get up to date with 4.1.0 if that helps. Is the stack trace the same as before? Can you provide it? Why does 1832 still fail if we're catching OSError?

Enchufa2 commented 2 years ago

Same stack trace. To fix this, it is necessary to implement both retries (not only timeout retries) and 1832.

Andrew-Chen-Wang commented 2 years ago

gotcha, to be clear it's Connection's retry_on_error that resolves this right? Though I'm not understanding how it resolves it using default Connection as I think the default only catches Timeout error? Are you adding additional errors? Thanks for the help along the way @Enchufa2!

Enchufa2 commented 2 years ago

gotcha, to be clear it's Connection's retry_on_error that resolves this right?

No, it's call_with_retry, from the Retry class. The issue is two-fold:

  1. When aioredis tries to use a connection, it retries only if there's a timeout. Otherwise, it simply fails. redis implements call_with_retry, which always retries, whatever happens, and then raises a disconnect if everything goes wrong. Then, the disconnect method tests the socket ignoring OSError. So it doesn't fail on resets and aborts, and reconnection can happen.
  2. When aioredis or redis ask for a new connection from the pool, if the connection is down due to a reset or an abort, OSError is raised. This is what 1832 fixes.
Andrew-Chen-Wang commented 2 years ago

@Enchufa2 ok I didn't think it was that as we implemented all call_with_retry, but, as it turns out, in the port, we forgot to await the fail/disconnect call in call_with_retry... It's now fixed in #1156. May you please test it one more time for me. Thank you so much! (I've also implemented redis/redis-py#1832 to #1156)

Enchufa2 commented 2 years ago

We're still in the same spot with those changes:

Traceback (most recent call last):
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/connection.py", line 802, in disconnect
    await self._writer.wait_closed()  # type: ignore[union-attr]
  File "/usr/lib64/python3.10/asyncio/streams.py", line 344, in wait_closed
    await self._protocol._get_close_waiter(self)
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/connection.py", line 853, in send_packed_command
    await asyncio.wait_for(
  File "/usr/lib64/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/connection.py", line 835, in _send_packed_command
    await self._writer.drain()
  File "/usr/lib64/python3.10/asyncio/streams.py", line 360, in drain
    raise exc
  File "/usr/lib64/python3.10/asyncio/selector_events.py", line 856, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
ConnectionAbortedError: [Errno 103] Software caused connection abort

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/***/redis-issue/test-aioredis.py", line 9, in <module>
    print(loop.run_until_complete(redis.get("something")))
  File "/usr/lib64/python3.10/asyncio/base_events.py", line 641, in run_until_complete
    return future.result()
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/client.py", line 1173, in execute_command
    return await conn.retry.call_with_retry(
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/retry.py", line 55, in call_with_retry
    await fail(error)
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/client.py", line 1162, in _disconnect_raise
    raise error
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/retry.py", line 52, in call_with_retry
    return await do()
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/client.py", line 1151, in _send_command_parse_response
    await conn.send_command(*args)
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/connection.py", line 878, in send_command
    await self.send_packed_command(
  File "/home/***/.local/lib/python3.10/site-packages/aioredis/connection.py", line 867, in send_packed_command
    raise ConnectionError(
aioredis.exceptions.ConnectionError: Error 103 while writing to socket. Software caused connection abort.

Now I'm a bit lost. How does a disconnect involve a send_packed_command? Or am I reading the trace incorrectly?

Andrew-Chen-Wang commented 2 years ago

Take a look at what Retry is set to in your connection object: https://github.com/aio-libs/aioredis-py/blob/cb93ef4a6fa80f3d32d78aced58d8df196aa58e1/aioredis/connection.py#L629

In the stack trace, it did fail. My guess is you had retry_on_timeout implemented so you got 1 retry. Retry performed the send command function again. I guess what I wouldn't understand is whythe exception still wasn't caught. Maybe the Rrtry class also needs to catch OSError? Probably using a debugger would be helpful to catch whether the disconnected connection was being reused?

Enchufa2 commented 2 years ago

Were you able to reproduce the error? If not, I think we should focus on that first. :) You need to find a way to reset or abort a TCP connection in macOS, or otherwise, test this into a Linux box, where you can use ss, which aborts the connection. AFAIK, there's no equivalent in macOS. But there's the tcpkill utility, from dsniff, which seems to work on macOS too (at least this fork).

I just tried tcpkill myself on Linux and works nicely by resetting the connection (exactly the same issue I've found in production). There are two things you need to do for this to work. First, launch tcpkill:

$ sudo tcpkill -i lo port 6379

Here tcpkill keeps listening the loopback interface and will reset any TCP connection to port 6379 (if your Redis host is not local, then just change the interface accordingly). And secondly, you need to force an IPv4 connection (because tcpkill doesn't understand IPv6). Again for localhost, you just need to specify redis = Redis(host="127.0.0.1") in my test case above.

cjdsellers commented 2 years ago

Seems like you guys are really close to a solution on this? Thank you for all the effort here!

steve-marmalade commented 2 years ago

Now that Aioredis is in redis-py 4.2.0rc1, is the expectation that this issue will go away once we migrate to redis? Any idea when a non-rc version will be released? Thanks for all your hard work!

Olegt0rr commented 2 years ago

Still actual

File "aioredis/client.py", line 1082, in execute_command
    conn = self.connection or await pool.get_connection(command_name, **options)
  File "aioredis/connection.py", line 1422, in get_connection
    if await connection.can_read():
  File "aioredis/connection.py", line 893, in can_read
    return await self._parser.can_read(timeout)
  File "asyncio/streams.py", line 665, in read
    raise self._exception
ConnectionResetError: [Errno 104] Connection reset by peer
adrienyhuel commented 2 years ago

@Enchufa2 ok I didn't think it was that as we implemented all call_with_retry, but, as it turns out, in the port, we forgot to await the fail/disconnect call in call_with_retry... It's now fixed in #1156. May you please test it one more time for me. Thank you so much! (I've also implemented redis/redis-py#1832 to #1156)

@Andrew-Chen-Wang I see that thoses fixes are in pull requests in aioredis, but not in redis-py asyncio. Will this be ported too ? Actually I'm stuck with aioredis 1.3.1 due to ConnectionResetError :(

Andrew-Chen-Wang commented 2 years ago

i've lost a severe amt of time to work so not sure when it'll be ported over. Maybe sometime in June unless someone else would like to help out. Anyone can help!

steve-marmalade commented 2 years ago

Hi @Andrew-Chen-Wang could you point me in the direction of the changes that need to be ported over? We have currently disabled connection pooling due to the stability issues, and it definitely has introduced a performance overhead.

Andrew-Chen-Wang commented 2 years ago

hi @steve-marmalade please head over to redis/redis-py to ask this question. I'm fairly busy lately, but I'll try to answer when I find the time. Sorry!

edit sorry! I shoulda looked at this more carefully

Andrew-Chen-Wang commented 2 years ago

@chayim do you think you can port over the PR that fixed this issue? Thanks!

steve-marmalade commented 2 years ago

Gentle bump @Andrew-Chen-Wang , @chayim