espressif / esp-mqtt

ESP32 mqtt component
Apache License 2.0
611 stars 257 forks source link

MQTT5 "flow control" gets out of sync, prevents message sending (IDFGH-9028) #243

Closed egnor closed 1 year ago

egnor commented 1 year ago

Note, I can spin up a PR to implement my recommended fix (see below) if that's helpful.

Repro case

This is just one example case, there are others (see analysis below).

While the server is frozen, this kind of message shows up

E (567293) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack
I (567299) mqtt_client: esp_mqtt_client_enqueue check fail
E (567305) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack
I (567313) mqtt_client: esp_mqtt_client_enqueue check fail
D (567319) mqtt_client: No data received in 0ms (all is quiet)
D (567324) mqtt_client: Sending Duplicated QoS1 message with id=11933
D (567330) mqtt5_client: Sent (49) qos > 0 publish packet without ack

In the failure state, debug messages look like this

E (599892) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack
I (599898) mqtt_client: esp_mqtt_client_enqueue check fail
E (599904) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack
I (599912) mqtt_client: esp_mqtt_client_enqueue check fail
D (600129) mqtt_client: No data received in 0ms (all is quiet)

Analysis

As per the MQTT5 spec, the MQTT5 client implements flow control. When a QOS=1 or QOS=2 PUBLISH is sent, esp_mqtt5_flow_control() increments client->send_publish_packet_count: https://github.com/espressif/esp-mqtt/blob/dffabb067fb3c39f486033d2e47eb4b1416f0c82/mqtt5_client.c#L19

When a PUBACK is received, esp_mqtt5_parse_puback() decrements the counter: https://github.com/espressif/esp-mqtt/blob/dffabb067fb3c39f486033d2e47eb4b1416f0c82/mqtt5_client.c#L45

This counter is checked in `esp_mqtt5_client_publish_check(): https://github.com/espressif/esp-mqtt/blob/dffabb067fb3c39f486033d2e47eb4b1416f0c82/mqtt5_client.c#L294

This is a bit fragile because any corner case that fails to increment or decrement appropriately can lead to an out-of-sync, stuck counter that's permanently too high... so it needs to be carefully checked to ensure that ALL transmissions of PUBLISH and receptions of PUBACK are logged correctly. In this case, I see at least three defects:

  1. esp_mqtt5_parse_puback() is only called from mqtt_process_receive() if is_valid_mqtt_msg() is true for the message-ID in question. However, the counter should be decremented for any PUBACK, even if, say, the original message has been deleted due to expiration
  2. esp_mqtt5_client_publish_check() isn't called for DUP retransmissions
  3. The counter isn't reset to zero on reconnection (as far as I can tell)

Recommendation?

Split out the increment/decrement logic from other code, since it needs to be called very specifically:

ESP-YJM commented 1 year ago

@egnor Thanks for reporting it. I will fixed it.

ESP-YJM commented 1 year ago

@egnor Could you please try the pacth whether can solve the issue you met. 0001-mqtt5-Fix-flow-control-will-regard-the-DUP-packet-an.patch

egnor commented 1 year ago

@ESP-YJM Yes, that seems to work, at least in my testing!