Gracefully handle Slack websocket closing

torgeirl commented 4 years ago

I am...

[ ] Reporting a bug
[x] Suggesting a new feature
[ ] Requesting help with running my bot
[ ] Requesting help writing plugins
[ ] Here about something else

I am running...

Errbot version: 6.1.8 OS version: Debian (python:3.9-slim container) Python version: 3.9 Using a virtual environment: no (using Docker instead) Backend: Slack

Issue description

The Slack RTM websocket closing (Slack goodbye event) causes a Python error. This Python error should be handled more gracefully (ie using some try/except), and instead log a warning when the websocket is closed and reopened.

Steps to reproduce

Run a Slack bot using the master branch, and leave it running for a few hours. After 3-12hr you should get a Python error due to the Slack RTM websocket closed.

Workaround

Switch to the slackv3 backend.

Additional info

2021-06-25 22:19:23,760 ERROR    errbot.backends.slack     Error reading from RTM stream:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/slackclient/server.py", line 283, in websocket_safe_read
    data += "{0}\n".format(self.websocket.recv())
  File "/usr/local/lib/python3.9/site-packages/websocket/_core.py", line 310, in recv
    opcode, data = self.recv_data()
  File "/usr/local/lib/python3.9/site-packages/websocket/_core.py", line 327, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/usr/local/lib/python3.9/site-packages/websocket/_core.py", line 340, in recv_data_frame
    frame = self.recv_frame()
  File "/usr/local/lib/python3.9/site-packages/websocket/_core.py", line 374, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/usr/local/lib/python3.9/site-packages/websocket/_abnf.py", line 361, in recv_frame
    self.recv_header()
  File "/usr/local/lib/python3.9/site-packages/websocket/_abnf.py", line 309, in recv_header
    header = self.recv_strict(2)
  File "/usr/local/lib/python3.9/site-packages/websocket/_abnf.py", line 396, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "/usr/local/lib/python3.9/site-packages/websocket/_core.py", line 449, in _recv
    return recv(self.sock, bufsize)
  File "/usr/local/lib/python3.9/site-packages/websocket/_socket.py", line 93, in recv
    raise WebSocketConnectionClosedException(
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/errbot/backends/slack.py", line 442, in serve_once
    for message in self.sc.rtm_read():
  File "/usr/local/lib/python3.9/site-packages/slackclient/client.py", line 235, in rtm_read
    json_data = self.server.websocket_safe_read()
  File "/usr/local/lib/python3.9/site-packages/slackclient/server.py", line 300, in websocket_safe_read
    raise SlackConnectionError(
slackclient.server.SlackConnectionError: Unable to send due to closed RTM websocket
2021-06-25 22:19:23,766 INFO     errbot.core               Disconnect callback, deactivating all the plugins.
2021-06-25 22:19:23,775 INFO     errbot.backends.base      Reconnecting in 1 seconds (0 attempted reconnections so far).
2021-06-25 22:19:24,777 INFO     errbot.backends.slack     Verifying authentication token
2021-06-25 22:19:25,107 INFO     errbot.backends.slack     Connecting to Slack real-time-messaging API
2021-06-25 22:19:25,802 INFO     errbot.backends.slack     Connected
2021-06-25 22:19:25,955 INFO     errbot.core               Activate internal commands
2021-06-25 22:19:25,955 INFO     errbot.plugin_manager     Activate bot plugins...
2021-06-25 22:19:25,955 INFO     errbot.plugin_manager     Activate plugin: Help.
2021-06-25 22:19:25,957 INFO     errbot.core_plugins.wsview Checking Help for webhooks
2021-06-25 22:19:25,959 INFO     errbot.plugin_manager     Activate plugin: CommandNotFoundFilter.
2021-06-25 22:19:25,962 INFO     errbot.core_plugins.wsview Checking CommandNotFoundFilter for webhooks
2021-06-25 22:19:25,963 INFO     errbot.plugin_manager     Activate plugin: Utils.
2021-06-25 22:19:25,964 INFO     errbot.core_plugins.wsview Checking Utils for webhooks
2021-06-25 22:19:25,964 INFO     errbot.plugin_manager     Activate plugin: Plugins.
2021-06-25 22:19:25,967 INFO     errbot.core_plugins.wsview Checking Plugins for webhooks
2021-06-25 22:19:25,967 INFO     errbot.plugin_manager     Activate plugin: ACLs.
2021-06-25 22:19:25,969 INFO     errbot.core_plugins.wsview Checking ACLS for webhooks
2021-06-25 22:19:25,969 INFO     errbot.plugin_manager     Activate plugin: Health.
2021-06-25 22:19:25,971 INFO     errbot.core_plugins.wsview Checking Health for webhooks
2021-06-25 22:19:25,972 INFO     errbot.plugin_manager     Activate plugin: Tradingpost.
2021-06-25 22:19:25,973 INFO     errbot.core_plugins.wsview Checking Tradingpost for webhooks
2021-06-25 22:19:25,977 INFO     errbot.plugin_manager     Activate plugin: Random.
2021-06-25 22:19:25,978 INFO     errbot.core_plugins.wsview Checking Random for webhooks
2021-06-25 22:19:25,978 INFO     errbot.core               Notifying connection to all the plugins...
2021-06-25 22:19:25,979 INFO     errbot.core               Plugin activation done.

nzlosh commented 4 years ago

Are you using plain bot scope or the new granular scopes?

torgeirl commented 4 years ago

Are you using plain bot scope or the new granular scopes?

I'm using the old, plain bot scope. How would that effect the gracefulness of this error? :confused:

nzlosh commented 4 years ago

I need to understand your setup fully so that I can attempt to replicate the issue you're having.

nzlosh commented 4 years ago

I was able to reproduce the issue. I'll dig into the code to see what can be done to handle this case cleanly.

nzlosh commented 4 years ago

I've got a tentative solution here https://github.com/errbotio/errbot/pull/1463. Would you be able to test it on your setup @torgeirl ?

torgeirl commented 4 years ago

I've got a tentative solution here #1463. Would you be able to test it on your setup @torgeirl ?

@nzlosh: I tested it out, but it didn't change anything. :disappointed:

It does however look like the changes are made for the SlackRTM backend, and not the regular Slack backend? Both backends reference RTM streams, but this particular error occurs in errbot/backends/slack.py. Would this solution also work for the regular Slack backend?

(It was my understanding that the SlackRTM backend is beta so once we got the regular Slack backend working again last month I switch back to it.)

nzlosh commented 4 years ago

You're quite right, I mistook the RTM stream errors to be related to the slack_rtm backend. The issue is the same between both backends, but the code is not the same, so the changes I've made won't apply to slack. I'll take a look at patching the slack backend when I get some free time.

torgeirl commented 4 years ago

OK. Thanks for taking the time to look into this!

feiyang233 commented 3 years ago

@torgeirl I also got this issue, because reaching the maximum duration of a RTM web socket connection (8 hours), slack close the RTM connection. https://api.slack.com/events/goodbye

Do you have any update for this one? Thanks

nzlosh commented 3 years ago

Would you mind trying this backend https://github.com/nzlosh/err-backend-slackv3 and letting me know if you encounter the problem.

feiyang233 commented 3 years ago

I using errbotv5.2 and slackclientv1.3.1, because our slack is huge, so rtm.start is naturally more difficult to use with Enterprise Grid and other large workspaces.

We often got the rtm_read error, after troubleshooting, we find out the bottleneck is CPU, old instance type is t3a.medium, which is often crash at errbot restart and need to wait for 6 mins to pull information from slack.

After the change to c5.large, I test 3 times restart just now, no crash anymore, faster to pull information from slack, only 4 mins

drahamim commented 3 years ago

I have been receiving the same error message in 2 different bots. (One running locally and one running in a k8s cluster) Utilizing the 'Slack' backend. Unlimited mem/cpu errbot 6.1.7 slackclient 1.3.2

should i be using Slack_rtm instead?

nzlosh commented 3 years ago

@drahamim The websocket closing issue was resolved on the slackv3 if you'd like to try it out.

torgeirl commented 3 years ago

Updated the opening post with my current environment and included the slackv3 backend as a workaround to help users finding this issue when looking for answers.

I do hope this issue either can be fixed for for the slack backend or that the slackv3 backend becomes stable enough for inclusion in the main repository.

torgeirl commented 9 months ago

Can confirm the connection re-establishing itself is graceful handled as of Errbot v6.2.0 using the SlackV3 backend. Every few hours it logs the following info messages (INFO; not ERROR as before):

2024-01-13 21:08:57,031 INFO     slack_sdk.socket_mode.builtin.client The connection seems to be stale. Disconnecting... (session id: <old-session-ID>, reason: disconnected for 25+ seconds)
2024-01-13 21:08:59,230 INFO     slack_sdk.socket_mode.builtin.client The connection has been closed (session id: <old-session-ID>)
2024-01-13 21:08:59,230 INFO     slack_sdk.socket_mode.builtin.client The session seems to be already closed. Reconnecting... (session id: <old-session-ID>)
2024-01-13 21:08:59,231 INFO     slack_sdk.socket_mode.builtin.client Connecting to a new endpoint...
2024-01-13 21:08:59,902 INFO     slack_sdk.socket_mode.builtin.client The connection has been closed (session id: <old-session-ID>)
2024-01-13 21:08:59,902 INFO     slack_sdk.socket_mode.builtin.client A new session has been established (session id: <new-session-ID>)
2024-01-13 21:08:59,902 INFO     slack_sdk.socket_mode.builtin.client Connected to a new endpoint...
2024-01-13 21:09:00,052 INFO     slack_sdk.socket_mode.builtin.client Stopped receiving messages from a connection (session id: <old-session-ID>)
2024-01-13 21:09:00,153 INFO     slack_sdk.socket_mode.builtin.client Starting to receive messages from a new connection (session id: <new-session-ID>)
2024-01-13 21:09:00,154 INFO     errbot.core               Activate internal commands
2024-01-13 21:09:00,155 INFO     errbot.plugin_manager     Activate bot plugins...

The graceful handling is immediately followed by some errors however, but that seems to be caused by the CORE_PLUGINS issues explained in #1591:

INFO     errbot.core               Activate internal commands
INFO     errbot.plugin_manager     Activate bot plugins...
ERROR    errbot.plugin_manager     Error loading VersionChecker.
(...)
AttributeError: 'NoneType' object has no attribute 'is_activated'
ERROR    errbot.plugin_manager     Error loading TextCmds.
(etc, etc, etc)

errbotio / errbot