Open dtwood opened 2 years ago
I'm not following the sequence of events here. The relevant parts of the trace are what happens in the lead up to calling connection lost, and then the reconnect attempts. If you can attach them as a file rather than pasting the contents that would be great.
I think you should only get connectionLost called after both MQTT 3.1.1 and 3.1 have been tried.
I presume you're not using auto-reconnect, otherwise calling connect from connectionLost as well will conflict. It should be one or the other.
I'm afraid I can't currently reproduce the crash, but here's full logs of a session which connects to a MQTT 3.1.1-compatible server, the connection is then interrupted, and the library reconnects using MQTT 3.1.
The connection being interrupted is done by running sudo ss -o state established '( dport = :8883 or dport = :1883 )' --kill
from another command line window.
I've put the Rust source code for the application in the attached ZIP file, but it's a simple application that just:
AsyncClient
, with automatic reconnection disabled (the default)I think that the crash we see in our application is a result of a race condition between the unexpected reconnection performed by the library, and a new connection from our application. One of those connection attempts succeeds, and the second then causes the first to disconnect (as Google Core IoT only allows a single connection for each device). Those two callbacks both get delivered with the same context pointer. Unfortunately, it seems to be very difficult to reproduce - even after setting traffic control rules to increase the latency with tc
and interrupting the main TCP connection with ss
, I'm still not able to reproduce the crash off-target.
However, after setting MQTTVersion = MQTTVERSION_3_1_1
in our main application a few months ago, it does not seem to have occurred again. So I think that just removing the unexpected reconnection should also resolve this crash.
Ok. Removing the fallback to MQTT 3.1 in the next major version of the library seems like the best thing to do. It's hardly going to be needed these days anyway.
Describe the bug When a MQTT connection is negotiated with the server using protocol version 3.1, and then the connection times out, an attempt to connect will be done with protocol version 3. The application is also notified that the connection has terminated, and will begin its own reconnection attempt.
These two connections can race, resulting in the
onSuccess
/onFailure
callback being called multiple times.We're working around this in our application by setting
MQTTVersion = MQTTVERSION_3_1_1
and only having a single item in the list of server URLs.To Reproduce I'm not quite sure - I can reproduce this on our system by disconnecting the radio link, and reconnecting it 5 seconds later. But I don't seem to be able to do that on a machine with a wired network connection. I think that the network interface needs to not go down, so maybe this can be reproduced with an Ethernet interface connected to a switch, and then disconnecting the uplink from that switch briefly? But here are logs, interleaved MQTT tracing and tracing of the paho.mqtt.rust logs:
The application receives the connection lost callback, and begins the reconnection:
However, paho.mqtt.c is continuing to attempt to connect to the server with
MQTTVERSION_3_1
, and there are two connection attempts in the queue, with the same token:This then causes double delivery of the token at address
0xb7239c10
- the first attempt (the real connection requested by the application) fails, and the application deallocates the memory there. Then the second token is dereferenced, causing the application to attempt to lock a mutex stored in deallocated memory:Expected behavior The
onSuccess
ondonFailure
callback are only called once in total for any token.Environment