eclipse / paho.mqtt.python

paho.mqtt.python
Other
2.19k stars 721 forks source link

Listener thread dies with uncaught thread exception in home-assistant integration #704

Closed AdrianGarside closed 9 months ago

AdrianGarside commented 1 year ago

Seeing the listener thread die. Seems to be associated either with bad data from the mqtt server or, less likely IMO, mishandling of legitimate data:

In this repro (matched to the uncaught thread exception stack below), I got handed incomplete data which triggered a json decode exception in my code which I caught and logged. But the thread still died: 2023-02-10 21:52:06.922 DEBUG (Thread-5 (_thread_main)) [custom_components.bambu_lab.pybambu] An exception occurred: 2023-02-10 21:52:06.922 DEBUG (Thread-5 (_thread_main)) [custom_components.bambu_lab.pybambu] Type: <class 'json.decoder.JSONDecodeError'> 2023-02-10 21:52:06.922 DEBUG (Thread-5 (_thread_main)) [custom_components.bambu_lab.pybambu] Args: ('Unterminated string starting at: line 1 column 22 (char 21)',) 2023-02-10 21:52:06.923 ERROR (Thread-5 (_thread_main)) [root] Uncaught thread exception

But some repros I get bad data and log the exception and the thread doesn't die. And in some repros my exception logging never fired at all and all I see is the uncaught thread exception log output.

For the uncaught thread exception (matched to the top repro) I found this logging from home assistant:

Logger: root Source: /usr/src/homeassistant/homeassistant/bootstrap.py:342 First occurred: February 10, 2023 at 9:52:06 PM (1 occurrences) Last logged: February 10, 2023 at 9:52:06 PM

Uncaught thread exception Traceback (most recent call last): File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3591, in _thread_main self.loop_forever(retry_first_connection=True) File "/usr/local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1756, in loop_forever rc = self._loop(timeout) File "/usr/local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1164, in _loop rc = self.loop_read() File "/usr/local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1556, in loop_read rc = self._packet_read() File "/usr/local/lib/python3.10/site-packages/paho/mqtt/client.py", line 2439, in _packet_read rc = self._packet_handle() File "/usr/local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3033, in _packet_handle return self._handle_publish() File "/usr/local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3283, in _handle_publish (topic, packet) = struct.unpack(pack_format, packet) struct.error: bad char in struct format

Since this exception silently takes out the listener thread I'm left with no incoming data and I'm looking for a way I can detect and recover from this state. Repro rate is random but typically happens within 1-10 hours of runtime.

kocsis-antal commented 1 year ago

Same here:

2023-05-04 22:53:59.589 ERROR (Thread-113 (_thread_main)) [root] Uncaught thread exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
...
MattBrittan commented 9 months ago

I'm going to close this in favour of #113 ("loop_forever should really loop_forever") which I believe is really the same issue. We need to improve clarity around when loop_forever could stop (particularly around exceptions).

Note: This is part of an exercise to clean up old issues so that the project can move forwards. Due to the number of issues being worked through mistakes will be made; please feel free to reopen this issue (or comment) if you believe it's been closed in error.

AdrianGarside commented 9 months ago

@MattBrittan 113 mentioned documenting it could throw exceptions but IIRC this case properly the problem was I couldn't catch the exception to recover. But... this hasn't been causing problems for some time now so either the source mqtt server behavior that caused it got fixed or I did find a workaround.

MattBrittan commented 9 months ago

was I couldn't catch the exception to recover.

You can add a try/catcharoundloop_forever` (see this).

But, unfortunately, this has an issue for some time and changes in python itself have had an impact.. So hopefully it's fixed for you.

The aim is document what exceptions (if any) can escape loop_forever (hopefully the vast majority will be caught!).

AdrianGarside commented 9 months ago

IIRC that's exactly what I had and it didn't help. The stack above has none of my code on it and the exception never got raised on my thread. But it's not been happening for some time now so something must have coincidentally changed. Yes, re-reading what I wrote - I had an exception handler that logged any unknown/unexpected exceptions - it never triggered in this scenario. I just got left with a dead mqtt session and no way to know.