Remove MQTT 3.1 fallback

dtwood commented 2 years ago

Describe the bug When a MQTT connection is negotiated with the server using protocol version 3.1, and then the connection times out, an attempt to connect will be done with protocol version 3. The application is also notified that the connection has terminated, and will begin its own reconnection attempt.

These two connections can race, resulting in the onSuccess/onFailure callback being called multiple times.

We're working around this in our application by setting MQTTVersion = MQTTVERSION_3_1_1 and only having a single item in the list of server URLs.

To Reproduce I'm not quite sure - I can reproduce this on our system by disconnecting the radio link, and reconnecting it 5 seconds later. But I don't seem to be able to do that on a machine with a wired network connection. I think that the network interface needs to not go down, so maybe this can be reproduced with an Ethernet interface connected to a switch, and then disconnecting the uplink from that switch briefly? But here are logs, interleaved MQTT tracing and tracing of the paho.mqtt.rust logs:

20211112 154227.812 m->c->connect_state = 0
20211112 154227.812 SSLSocket error (5) in SSL_read - getch for socket 13 rc -1 errno 110 Connection timed out
20211112 154227.812 Error from MQTTAsync_cycle() - removing socket 13
20211112 154227.812 sent 0 256 buflen 5
20211112 154231.677 sent 772 257 buflen 1
20211112 154231.677 SSLSocket error (5) in SSL_write for socket 13 rc -1 errno 32 Broken pipe
20211112 154231.677 13 projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device -> DISCONNECT (-1)
20211112 154231.677 Removed socket 13
20211112 154231.677 Reset max fdp1 to 1
20211112 154231.677 Calling connectionLost for client projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device
[2021-11-12T15:42:31Z WARN  paho_mqtt::async_client] Connection lost. Context: 0xb7257778
[2021-11-12T15:42:31Z INFO  oamd::mqtt] MQTT connection lost callback
20211112 154231.677 Connect failed, more to try
20211112 154231.677 Connecting to serverURI mqtt.2030.ltsapis.goog:8883 with MQTT version 3
[2021-11-12T15:42:32Z ERROR oamd::mqtt] Failed to publish MQTT message: Elapsed(())
20211112 154231.677 New socket 13 for mqtt.2030.ltsapis.goog:8883, port 8883
20211112 154231.677 Connect pending
[2021-11-12T15:42:36Z DEBUG oamd::mqtt] Connecting to MQTT broker
20211112 154231.677 Return code 1 from read select
20211112 154231.677 Return code 1 from write select
20211112 154231.677 m->c->connect_state = 1
20211112 154236.798 SSL handshake started write:unknown:unknown
20211112 154236.798 SSL connect:SSLv3/TLS write client hello
20211112 154236.812 m->c->connect_state = 2
20211112 154236.812 received 0 256 buflen 5
20211112 154236.812 SSL state connect:SSLv3/TLS write client hello:(NONE)
20211112 154236.908 SSL state connect:SSLv3/TLS write finished:TLS_AES_256_GCM_SHA384
20211112 154236.908 SSL handshake done write:unknown:unknown
20211112 154236.908 SSL certificate verification: X509_V_OK
20211112 154236.908 SSL connect:SSL negotiation finished successfully
20211112 154236.908 peername from X509_check_host is *.2030.ltsapis.goog
20211112 154236.908 13 projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device -> CONNECT version 3 clean: 1 (0)
[2021-11-12T15:42:36Z INFO  oamd::mqtt] About to connect to MQTT
[2021-11-12T15:42:36Z DEBUG paho_mqtt::async_client] Connecting handle: 0xb7239c10
[2021-11-12T15:42:36Z DEBUG paho_mqtt::async_client] Connect options: ConnectOptions […]
20211112 154236.946 SSL alert write:warning:close notify
20211112 154236.946 Removed socket 13
20211112 154236.946 Reset max fdp1 to 1
20211112 154236.946 0 inflight messages deleted for client projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device
20211112 154236.946 Calling connect failure for client projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device
[2021-11-12T15:42:36Z DEBUG paho_mqtt::token] Token failure! 0xb7265010, 0xb55fec14
[2021-11-12T15:42:36Z DEBUG paho_mqtt::token] Token failure message: "TCP connect timeout"
[2021-11-12T15:42:36Z DEBUG paho_mqtt::token] Token completed with code: -1
[2021-11-12T15:42:36Z DEBUG paho_mqtt::token] Expecting server response for: Connect
[2021-11-12T15:42:36Z DEBUG paho_mqtt::token] Got response: ServerResponse [...]
20211112 154236.946 Connecting to serverURI mqtt.2030.ltsapis.goog:8883 with MQTT version 4
[2021-11-12T15:42:36Z WARN  oamd::mqtt] Failed to connect to MQTT broker: PahoDescr(-1, "TCP connect timeout")
20211112 154236.946 New socket 13 for mqtt.2030.ltsapis.goog:8883, port 8883
20211112 154236.946 Connect pending
[2021-11-12T15:42:37Z DEBUG oamd::tcpsocket] Sending message Request(11, Echo)
20211112 154236.946 Return code 0 from read select
20211112 154236.946 Return code 1 from write select
20211112 154236.946 m->c->connect_state = 1
20211112 154237.952 SSL connect:SSLv3/TLS write client hello
20211112 154237.952 m->c->connect_state = 2
20211112 154237.952 SSL state connect:SSLv3/TLS write client hello:(NONE)
20211112 154238.017 SSL state connect:SSLv3/TLS write finished:TLS_AES_256_GCM_SHA384
20211112 154238.017 SSL handshake done write:unknown:unknown
20211112 154238.017 SSL certificate verification: X509_V_OK
20211112 154238.017 SSL connect:SSL negotiation finished successfully
20211112 154238.017 peername from X509_check_host is *.2030.ltsapis.goog
20211112 154238.031 13 projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device -> CONNECT version 4 clean: 1 (0)
[2021-11-12T15:42:38Z INFO  oamd::mqtt] Finished connecting to MQTT broker
20211112 154238.031 m->c->connect_state = 4
20211112 154238.031 SSL connect:SSL negotiation finished successfully
20211112 154238.960 13 projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device <- CONNACK rc: 0
20211112 154238.960 0 inflight messages deleted for client projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device
20211112 154238.960 0 responses removed for client projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device
20211112 154238.960 Calling connect success for client projects/[REDACTED]/locations/us-central1/registries/development-devices/devices/dtw-development-device
[2021-11-12T15:42:38Z DEBUG paho_mqtt::token] Token success! 0xb7265010, 0xb4dfec48
[2021-11-12T15:42:38Z DEBUG paho_mqtt::token] Token completed with code: 0

The application receives the connection lost callback, and begins the reconnection:

[2021-11-12T15:42:31Z WARN  paho_mqtt::async_client] Connection lost. Context: 0xb7257778
[2021-11-12T15:42:31Z INFO  oamd::mqtt] MQTT connection lost callback
20211112 154231.677 Connect failed, more to try
[...]
[2021-11-12T15:42:36Z DEBUG oamd::mqtt] Connecting to MQTT broker
[...]
20211112 154236.946 Connecting to serverURI mqtt.2030.ltsapis.goog:8883 with MQTT version 4

However, paho.mqtt.c is continuing to attempt to connect to the server with MQTTVERSION_3_1, and there are two connection attempts in the queue, with the same token:

20211112 154231.677 Connect failed, more to try
20211112 154231.677 Connecting to serverURI mqtt.2030.ltsapis.goog:8883 with MQTT version 3

This then causes double delivery of the token at address 0xb7239c10 - the first attempt (the real connection requested by the application) fails, and the application deallocates the memory there. Then the second token is dereferenced, causing the application to attempt to lock a mutex stored in deallocated memory:

[2021-11-12T15:42:36Z DEBUG paho_mqtt::async_client] Connecting handle: 0xb7239c10
[2021-11-12T15:42:36Z DEBUG paho_mqtt::token] Token failure! 0xb7265010, 0xb55fec14
[2021-11-12T15:42:38Z DEBUG paho_mqtt::token] Token success! 0xb7265010, 0xb4dfec48

#0  0xb6561524 in pthread_mutex_lock () from /lib/libpthread.so.0
#1  0xb663bd8c in pthread_mutex_lock () from /lib/libc.so.6
#2  0xb6928d30 in paho_mqtt::token::TokenInner::on_complete::hb89385dc32010d2b ()
#3  0xb6927d20 in paho_mqtt::token::TokenInner::on_success::h1aa35036e56d42c5 ()
#4  0xb693724c in MQTTAsync_receiveThread ()
#5  0xb655fed2 in ?? () from /lib/libpthread.so.0
#6  0xb655fed2 in ?? () from /lib/libpthread.so.0

Expected behavior The onSuccess ond onFailure callback are only called once in total for any token.

Environment

Linux 32-bit, Arm and PowerPC
paho.mqtt.c 1.3.9 with Rust paho-mqtt 0.9.1

icraggs commented 2 years ago

I'm not following the sequence of events here. The relevant parts of the trace are what happens in the lead up to calling connection lost, and then the reconnect attempts. If you can attach them as a file rather than pasting the contents that would be great.

I think you should only get connectionLost called after both MQTT 3.1.1 and 3.1 have been tried.

I presume you're not using auto-reconnect, otherwise calling connect from connectionLost as well will conflict. It should be one or the other.

davidthomaswood commented 2 years ago

I'm afraid I can't currently reproduce the crash, but here's full logs of a session which connects to a MQTT 3.1.1-compatible server, the connection is then interrupted, and the library reconnects using MQTT 3.1.

The connection being interrupted is done by running sudo ss -o state established '( dport = :8883 or dport = :1883 )' --kill from another command line window.

I've put the Rust source code for the application in the attached ZIP file, but it's a simple application that just:

Creates an AsyncClient, with automatic reconnection disabled (the default)
Connects to test.mosquitto.org on port 1883
Prints out each message received from the server (none expected), until the connection is interrupted
Sleeps for 10 seconds, to allow the library to finish the unexpected reconnect.

davidthomaswood commented 2 years ago

I think that the crash we see in our application is a result of a race condition between the unexpected reconnection performed by the library, and a new connection from our application. One of those connection attempts succeeds, and the second then causes the first to disconnect (as Google Core IoT only allows a single connection for each device). Those two callbacks both get delivered with the same context pointer. Unfortunately, it seems to be very difficult to reproduce - even after setting traffic control rules to increase the latency with tc and interrupting the main TCP connection with ss, I'm still not able to reproduce the crash off-target.

However, after setting MQTTVersion = MQTTVERSION_3_1_1 in our main application a few months ago, it does not seem to have occurred again. So I think that just removing the unexpected reconnection should also resolve this crash.

icraggs commented 1 year ago

Ok. Removing the fallback to MQTT 3.1 in the next major version of the library seems like the best thing to do. It's hardly going to be needed these days anyway.

eclipse-paho / paho.mqtt.c

Remove MQTT 3.1 fallback #1187