eclipse / paho.mqtt.c

An Eclipse Paho C client library for MQTT for Windows, Linux and MacOS. API documentation: https://eclipse.github.io/paho.mqtt.c/
https://eclipse.org/paho
Other
1.94k stars 1.09k forks source link

Reconnect problems at MQTT Publisher over WebSocket protocol, using QoS2 and persistent session #1401

Open leventemol opened 11 months ago

leventemol commented 11 months ago

Reconnect problems occur at MQTT publisher client implementation with MQTT broker connection over WebSocket protocol, using QoS2 topic types and persistent session.

Steps to reproduce the behavior:

  1. Create an MQTT publisher client connection to a broker: the connection is established, and publish messages are sent out by the client.
  2. Interrupt the connection, enabling a firewall rule to block the connection to specific remote ports.
  3. Following this the connection to the broker is lost. The failed publish count increments, in the onWriteFailure() callback.
  4. MQTTProtocol_startPublish() returns socket error in MQTTAsyncUtils::MQTTAsync_processCommand().
  5. After a specific time interval, disable the firewall rule to allow the client to connect back.
  6. When the Connect packet is sent out by the publisher, and the broker responds with Connect Acknowledge, the connection is attempted to be completed, such that all outbound publish messages are sent out first.
  7. Sending the outbound publish message for which the initial Publish failed, MQTTPacket_send_publish() returns socket error in MQTTProtocolClient::MQTTProtocol_retries(). This results a session close, and the connection does not complete.
  8. Following the reconnect attempts, the connection doesn't complete, connectionLost() callback is entered, updating the connection stop time, showing the time interval while outbound publish messages were sent out.

Parameter values used: Keep Alive Interval 50 s, Connect Timeout 30 s, Automatic Reconnect enabled, Minimum Retry Interval 1 s, Maximum Retry Interval 60 s.

Expected behaviour: After the connection is lost, the client connects back and sends all outbound publish messages.

Please find attached the log file for more information.

Environment: Windows 10 Pro log.txt

icraggs commented 11 months ago

Are there any messages on the broker log? It looks like the broker could be closing the connection, because of a situation it doesn't like. What is the broker?

leventemol commented 11 months ago

Hi! Have attached new log files from the Publisher and the Mosquitto Broker. There are some denied publish reports on the broker side.

publisher_log.txt mosquitto_broker_log.txt mosquitto_broker_log_extract.txt

icraggs commented 11 months ago

So Mosquitto is denying the publish and presumably closing the connection? The reason why Mosquitto is denying the publish needs to be identified. A quick search throws up ACL configuration, bugs, client connection limits as being some possible causes.

leventemol commented 11 months ago

In the broker log there is no received Publish shown for msgids 15 and 17 sent with duplicate flag value 0. There is denied Publish for msgids 15 and 17 sent with duplicate flag value 1.

It seems that the initial Publish for msgids 15 and 17 fails, when MQTTProtocol_startPublish() returns socket error in MQTTAsyncUtils::MQTTAsync_processCommand(). Sending the outbound publish messages (with duplicate flag value 1) for which the initial Publish failed, MQTTPacket_send_publish() returns socket error in MQTTProtocolClient::MQTTProtocol_retries(). This results a MQTTProtocol_closeSession() call, and the connection does not complete.

icraggs commented 11 months ago

The broker shouldn't deny a publish just because the dup flag is set. I don't think it should be denying any of these publishes. The reason for denying the publishes could be the source of the problem.

Getting a socket error on send often means the broker at the other end has closed the connection.

leventemol commented 11 months ago

The published message content is different for msgid 15 in MQTTPacket_send_publish() with duplicate flag set, from when the initial Publish failed. See publisher_log.txt entries MQTTProtocol_startPublish and MQTTPacket_send_publish, in publisher_log_extract.txt. The issue may be at the Publish object deallocation after MQTTProtocol_startPublish() when this returns socket error, marked with the comment in MQTTAsyncUtils::MQTTAsync_processCommand().

publisher_log.txt publisher_log_extract.txt

leventemol commented 11 months ago

In MQTTAsyncUtils::MQTTAsync_processCommand(), have tried to set the Publish object's payload and topic members with MQTTStrdup(), properties member with MQTTProperties_copy(), and did not deallocate the Publish object if MQTTProtocol_startPublish() returns socket error. In this case the published message content for msgid 15 in MQTTPacket_send_publish() with duplicate flag set, is the same as when the initial Publish failed. When the client reconnects, all outbound publish messages are sent out and the connection is completed. See the attached log files.

publisher_log.txt publisher_log_extract.txt

icraggs commented 10 months ago

I've tried this scenario, on Linux, on 1.3.12 and 1.3.13 and it worked up to now. Trace extract:

20231030 145843.445 3 Issue1401 -> PUBLISH msgid: 96 qos: 2 retained: 0 rc 0 payload len(10): Issue1401
dup flag is 0
20231030 145843.479 3 Issue1401 -> DISCONNECT (0)
...
20231030 145849.490 3 Issue1401 -> CONNECT version 4 clean: 0 (0)
20231030 145849.511 3 Issue1401 <- CONNACK rc: 0
20231030 145849.511 3 Issue1401 -> PUBLISH msgid: 96 qos: 2 retained: 0 rc 0 payload len(10): Issue1401
dup flag is 1
20231030 145849.527 3 Issue1401 <- PUBREC msgid: 96
20231030 145849.527 3 Issue1401 -> PUBREL msgid: 96 (0)

This is the test program I used:

Issue1401.c.txt

leventemol commented 6 months ago

Thank you for the information. We are using Python in our test environment, on Windows with Paho library version 1.3.12, and the connect/disconnect functionality is obtained using firewall rules for blocking remote ports at the MQTT Broker. We will try to use the same test program in our test environment, and will follow up with more information.