eclipse-paho / paho.mqtt.c

An Eclipse Paho C client library for MQTT for Windows, Linux and MacOS. API documentation: https://eclipse.github.io/paho.mqtt.c/
https://eclipse.org/paho
Other
1.98k stars 1.1k forks source link

Cannot shutdown paho gracefully if connection lost to the broker and there are still messages to publish #1474

Open smallSwed opened 6 months ago

smallSwed commented 6 months ago

Describe the bug I have a publisher publishing messages via MQTTAsync_sendMessage. If I initiate a shutdown but the connection was already lost to the broker, the MqttAsync_disconnect returns Success but never processed: no callback called (neither failure nor success) My disconnect wait will timeout. So I call the MqttAsync_destroy which will leak.

To Reproduce Not always reproducible, needs certain conditions met: it needs some PUBLISH commands in the MQTTAsync_commands when the connection is lost:

  1. start broker (e.g. mosquitto)
  2. start client to publish messages via the Async API
  3. kill the broker
  4. try to shut down the client via MqttAsync_disconnect -> wait for callbacks -> MqttAsync_destroy sequence
  5. see logs for leak report

Expected behavior A proper cleanup for unsent but queued publish messages maybe: MqttAsync_disconnect would disconnect no matter if there are queued publish commands

  1. MqttAsync_destroy would clean up the unset messages

Log files memory_leak.log

Points of interest in the log file:

I had to clean the log files from some of the logs of my application, but all of the PAHO logs are included in the file.

Environment (please complete the following information):

Additional context One condition for this bug is the lost connection to the broker and no reconnect in a reasonable time.

The leaked data is the payload and the destination name string passed to MQTTAsync_sendMessage for all unsent messages.

I did some investigation and it's comes down to this:

  1. the reason for the disconnect command was not processed is the following: our MqttAsync_disconnect call will put a DISCONNECT command at the end of  MQTTAsync_commands MQTTAsync_processCommand will find a PUBLISH command for our client which is disconnected so it will be not processed and always stays at the start of the list. Our client will be put inside the ignore_clients so the DISCONNECT will be ignored at the end of the list.
  2. Reason for the leak: In MQTTAsync_destroy there is an assumption that all the payloads and destinationNames are stored in MQTTProtocol_storePublication so the MQTTAsync_NULLPublishCommands called , but for some reason those are not stored there or not freed from there.

The leaked mallocs are from MQTTAsync_send:

if ((pub->command.details.pub.destinationName = MQTTStrdup(destinationName)) == NULL)
...
if ((pub->command.details.pub.payload = malloc(payloadlen)) == NULL)

The call stack:

T.dll!mymalloc() - Line 201
T.dll!MQTTAsync_send() - Line 1277
T.dll!MQTTAsync_sendMessage() - Line 1317
T.dll!OTMqttCommunication::send() - Line 1301
icraggs commented 1 month ago

Thanks for the investigation.

There's no MQTTAsync_disconnect call showing in this log - there's an internal call. If you did call MQTTAsync_disconnect I would expect it to return "not connected" as does MQTTAsync_send once the connection is lost. Then there would be no success or failure callback.

I've put in a simple fix for the memory leak. I think there might be a case I'm missing with this, but the CI tests pass, and I haven't found any other problem yet.

I've recently increased the synchronization of API calls in the library, so this may rule out the circumstances which cause the disconnect call to be processed. I've not seen this myself in any testing I've done so far.

Both these changes are in the develop branch.