eclipse / paho.mqtt.python

paho.mqtt.python
Other
2.15k stars 724 forks source link

SSL EOF leads to persistent non-delivery of messages #750

Closed gdt closed 6 months ago

gdt commented 10 months ago

I am using 1.6.1, NetBSD 9 amd64, python 3.10. The broker is up-to-date mosquitto, on a remote NetBSD 9 amd64 system. I am using 8883 with a real certificate, and generally everything works well.

I have a script which polls a UPS and publishes messages usually once a minute, but at intervals of up to 1s if something interesting has happened. On that system, I took the WAN interface down, releasing the lease, removing addresses and default route, and ifconfig down, waited an hour and brought it back. Obviously I won't have data in the meantime but I expect it to recover.

In the log (stdout/stderr) of the program I see (with the json shortened; there are actually 14 keys):

SUBMITTED time {'time': 1696117812.470824, 'runtime': 5940.0}
PUBLISHED 16075
SUBMITTED time {'time': 1696117873.4957478, 'runtime': 5940.0}
Exception in thread Thread-1 (_thread_main):
Traceback (most recent call last):
  File "/usr/pkg/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/pkg/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 3591, in _thread_main
    self.loop_forever(retry_first_connection=True)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1756, in loop_forever
    rc = self._loop(timeout)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1181, in _loop
    rc = self.loop_write()
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1577, in loop_write
    rc = self._packet_write()
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 2464, in _packet_write
    write_length = self._sock_send(
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 649, in _sock_send
    return self._sock.send(buf)
  File "/usr/pkg/lib/python3.10/ssl.py", line 1206, in send
    return self._sslobj.write(data)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)
PUBLISHED 16076
SUBMITTED time {'time': 1696117934.524036, 'runtime': 5940.0}
SUBMITTED time {'time': 1696117994.8527699, 'runtime': 5940.0}

So the "EOF" looks like write failed. Probably this is "no route to host". However I don't get a disconnect callback. And I did get a publish callback! But surely the broker didn't get the message. Then, future publish calls happen, with no exceptions. But no publish callbacks.

Obviously I can write a watchdog to close/open if I don't get pubacks to python.

But it seems that the ssl error should be caught and the connection should be judged non-functional. Or at least some later publish should cause failure and to close the connection. The user program should not get an exception thrown other than in response to an API call and then only documented exceptions.

I don't think this is related to threading/locking, as the program only did one publish call, and without the network being down, runs for months.

gdt commented 9 months ago

(I did write a watchdog, but calling close and then reopen did not result in messages flowing.)

Reading client.py, I see _packet_write calling _sock_send on line 2506. A number of exceptions are caught, but there is no general catc, and obviously the one above is not caught. It seems obvious that an exception that has not been considered carefully risks leaving the socket in a broken state, as happens to me. And therefore that other exceptions should lead to the close/retry step.

warped-rudi commented 6 months ago

Are there any updates on this issue? I'm seeing it as well when the internet connection gets lost temporarily.

PierreF commented 6 months ago

It seems you are using loop_start, and some exception aren't correctly handled which cause the loop thread to crash. Once crashed no more processing it done.

I've made a PR (#797) which should fix your problem.

gdt commented 6 months ago

Thanks for addressing this. I am in fact using loop_start. Modulo variables to set my username/etc. and to enable TLS, the code is at https://github.com/merbanan/rtl_433/blob/master/examples/rtl_433_mqtt_relay.py . But, aside from this failure, everything else is fine. I am grabbing the patch in the PR, will patch my system, and restart it.

warped-rudi commented 6 months ago

I gave this patch a try and first tests were positive. Thanks!