L235 / hat-collector

GNU General Public License v2.0
0 stars 4 forks source link

Bot does not recover from connection reset error #17

Closed L235 closed 2 years ago

L235 commented 2 years ago

Describe the bug When the bot experiences a ping timeout/connection reset it should exit so it can be restarted. However it just keeps running, with the application still running but not on IRC.

To reproduce Steps to reproduce the behavior:

  1. Wait for the bot to experience a ping timeout.

Expected behavior The application should quit, allowing the daemon to restart it.

Console output

WARNING:ReportBot:libera:>> Receive timeout reached, sending ping to check connection state...
ERROR:ReportBot:libera:Failed to execute on_raw_ping handler.
Traceback (most recent call last):
  File "/data/project/hat-collector/.local/share/virtualenvs/hat-collector-beta-Vi1f6vgq/lib/python3.9/site-packages/pydle/client.py", line 422, in on_raw
    await handler(message)
  File "/data/project/hat-collector/.local/share/virtualenvs/hat-collector-beta-Vi1f6vgq/lib/python3.9/site-packages/pydle/features/rfc1459/client.py", line 703, in on_raw_ping
    await self.rawmsg('PONG', *message.params)
  File "/data/project/hat-collector/.local/share/virtualenvs/hat-collector-beta-Vi1f6vgq/lib/python3.9/site-packages/pydle/client.py", line 312, in rawmsg
    await self._send(message)
  File "/data/project/hat-collector/.local/share/virtualenvs/hat-collector-beta-Vi1f6vgq/lib/python3.9/site-packages/pydle/client.py", line 361, in _send
    await self.connection.send(input)
  File "/data/project/hat-collector/.local/share/virtualenvs/hat-collector-beta-Vi1f6vgq/lib/python3.9/site-packages/pydle/connection.py", line 112, in send
    await self.writer.drain()
  File "/usr/lib/python3.9/asyncio/streams.py", line 387, in drain
    await self._protocol._drain_helper()
  File "/usr/lib/python3.9/asyncio/streams.py", line 190, in _drain_helper
    raise ConnectionResetError('Connection lost')
ConnectionResetError: Connection lost
ERROR:pydle.client:Encountered error on socket.
pydle.features.rfc1459.protocol.ServerError: Closing Link: wikimedia/bot/hat-collector (Ping timeout: 268 seconds)

Environment:

Additional context Add any other context about the problem here.

L235 commented 2 years ago

Honestly, if on_data_error would just quit the process that would be satisfactory.

PhantomTech commented 2 years ago

The issue's branch should update the bot to try to reconnect for any error that causes a disconnection and quit if it isn't able to reconnect or if the error is unrelated to a disconnection. I haven't tested the quitting part but in my test it does handle ping timeouts by attempting to reconnect.

L235 commented 2 years ago

Hm, I'm not sure this is working. On 206f4cc409d8e0fe4d40ba8b9f147a849c286f81, here is what happens if I use NickServ to GHOST the bot running this branch (to simulate an unexpected quit):

ERROR:pydle.client:Unexpected disconnect. Attempting to reconnect within 5 seconds.
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-4' coro=<BasicClient.handle_forever() done, defined at /home/kevin/.local/share/virtualenvs/hat-collector-WgNW31Vi/lib/python3.9/site-packages/pydle/client.py:363> exception=SSLError(1, '[SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2756)')>
Traceback (most recent call last):
  File "/home/kevin/.local/share/virtualenvs/hat-collector-WgNW31Vi/lib/python3.9/site-packages/pydle/client.py", line 367, in handle_forever
    data = await self.connection.recv(timeout=self.READ_TIMEOUT)
  File "/home/kevin/.local/share/virtualenvs/hat-collector-WgNW31Vi/lib/python3.9/site-packages/pydle/connection.py", line 115, in recv
    return await asyncio.wait_for(self.reader.readline(), timeout=timeout)
  File "/usr/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/usr/lib/python3.9/asyncio/streams.py", line 540, in readline
    line = await self.readuntil(sep)
  File "/usr/lib/python3.9/asyncio/streams.py", line 632, in readuntil
    await self._wait_for_data('readuntil')
  File "/usr/lib/python3.9/asyncio/streams.py", line 517, in _wait_for_data
    await self._waiter
  File "/usr/lib/python3.9/asyncio/sslproto.py", line 528, in data_received
    ssldata, appdata = self._sslpipe.feed_ssldata(data)
  File "/usr/lib/python3.9/asyncio/sslproto.py", line 206, in feed_ssldata
    self._sslobj.unwrap()
  File "/usr/lib/python3.9/ssl.py", line 948, in unwrap
    return self._sslobj.shutdown()
ssl.SSLError: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2756)

The application does not quit and the bot does not reconnect:

00:41 *** hats_alpha (~hats_alph@wikimedia/bot/hat-collector) has quit (Killed (NickServ (GHOST command used by hats_beta!~hats_beta@wikimedia/bot/hat-collector)))

(no reconnect)

PhantomTech commented 2 years ago

I'm not able to reproduce this. On 206f4cc409d8e0fe4d40ba8b9f147a849c286f81 this is what I get when I start the bot, wait for it to connect, then ghost it:

INFO:root:Connecting to DB
INFO:root:Preparing bot
INFO:root:Running bot
INFO:root:Syncing report channels
INFO:root:Running query: SELECT name FROM channels
INFO:root:Syncing rules
INFO:root:Running query: SELECT wiki, type, pattern, channel, ignore FROM rules ORDER BY ignore DESC
ERROR:pydle.client:Unexpected disconnect. Attempting to reconnect within 5 seconds.
INFO:root:Syncing report channels
INFO:root:Running query: SELECT name FROM channels
INFO:root:Syncing rules
INFO:root:Running query: SELECT wiki, type, pattern, channel, ignore FROM rules ORDER BY ignore DESC

The reconnection isn't even handled through part of this repos code, it's done in pydle's on_disconnect method here: https://github.com/Shizmob/pydle/blob/286e53363e72e8c10f09ee60c51f48ac5a9d8785/pydle/client.py#L321-L340

It looks like during your reconnection an error is thrown at https://github.com/Shizmob/pydle/blob/286e53363e72e8c10f09ee60c51f48ac5a9d8785/pydle/client.py#L367 but then it's just ignored. I'm not sure if the ignoring is pydle's fault or ours but I am able to reproduce that by making a !command throw an exception.

Overriding on_disconnect and having it do nothing might be a workaround for this issue but figuring out why exceptions are being ignored or why we're experiencing different behavior might lead to a better solution.

PhantomTech commented 2 years ago

I think I forgot to mention I made a commit b9eae087f9cb9ca6c733cb3c569e1d9710f5316c that might fix this. The change should prevent the bot from attempting to reconnect after an unexpected disconnection which will (hopefully) end the process and allow it to be restarted by a watchdog. I don't think I tested this at all myself since I wasn't able to reproduce the original error.

L235 commented 2 years ago

I think I forgot to mention I made a commit b9eae08 that might fix this. The change should prevent the bot from attempting to reconnect after an unexpected disconnection which will (hopefully) end the process and allow it to be restarted by a watchdog. I don't think I tested this at all myself since I wasn't able to reproduce the original error.

b9eae087f9cb9ca6c733cb3c569e1d9710f5316c likely fixed any remaining issues. I haven't noticed any unresolved disconnects. Thanks very much; closing.