espressif / esp-mqtt

ESP32 mqtt component
Apache License 2.0
608 stars 257 forks source link

Connection aborts when mqtt_resend_queued fails (IDFGH-12830) #277

Open Totrasmek opened 5 months ago

Totrasmek commented 5 months ago

Description

On transmission of data enqueued for sending in the outbox in mqtt_resend_queued, a failure of esp_mqtt_write will cause an MQTT disconnection. This can interrupt a devices uptime significantly, especially when network conditions are poor. This is because any network delays greater than the network timeout will cause MQTT to disconnect.

Code

mqtt_client.c line 1505

    // try to resend the data
    if (esp_mqtt_write(client) != ESP_OK) {
        ESP_LOGE(TAG, "Error to resend data ");
        esp_mqtt_abort_connection(client);
        return ESP_FAIL;
    }

Suggested fix

In order to avoid unnecessary disconnects, the conditional check could instead be (esp_mqtt_write(client) != ESP_OK) && (esp_mqtt_write(client) != ESP_ERR_TIMEOUT).

euripedesrocha commented 5 months ago

Hi @Totrasmek thanks for reporting. Could you elaborate on your testing scenarios? The timeout errors in our writing path point that the system has some error internal to the stack, and postponing the reconnection will only postpone the fix for the internal state.

Totrasmek commented 5 months ago

Hey @euripedesrocha , in my workspace we have some very poor networks. However, on a good network I recreated the dropouts by adding simulated network delays with sudo tc qdisc add dev <interface> root netem delay <delay_in_ms>ms, where <delay_in_ms> is greater than network_timeout_ms.

What you say about internal errors makes sense, but I would have thought that this code on line 708 of mqtt-client.c would catch internal errors:

if (wlen < 0) {
    ESP_LOGE(TAG, "Writing failed: errno=%d", errno);
    esp_mqtt_client_dispatch_transport_error(client);
    return ESP_FAIL;
}

And that this code from line 714 of mqtt-client.c would catch timeouts like I experience:

if (wlen == 0) {
    ESP_LOGE(TAG, "Writing didn't complete in specified timeout: errno=%d", errno);
    return ESP_ERR_TIMEOUT;
}

However, if wlen == 0 happens due to internal state errors, then this is definitely more complicated.

euripedesrocha commented 5 months ago

Could you share which target are you using? Also, mqtt client configuration and idf version.

Totrasmek commented 5 months ago

We are using v4.4.6 but aim to move to latest v5.2.1 soon. I'll see if I can get a v5.2.1 test setup so that I can confirm whether or not the issue exists in the latest version. I assumed as the code was quite similar that behaviour would be too, but I might be wrong.

Target: ESP32 Version: 4.4.6 MQTT Client Config:

        esp_mqtt_client_config_t mqtt_cfg = {
            .keepalive                  = 15,
            .task_prio                  = 1,
            .task_stack                 = 30000,
            .buffer_size                = 30000,
            .reconnect_timeout_ms       = 5000,
            .network_timeout_ms         = 1000,
            .message_retransmit_timeout = 0,
        };

SDKConfig:

#
# ESP-MQTT Configurations
#
CONFIG_MQTT_PROTOCOL_311=y
CONFIG_MQTT_TRANSPORT_SSL=y
CONFIG_MQTT_TRANSPORT_WEBSOCKET=y
CONFIG_MQTT_TRANSPORT_WEBSOCKET_SECURE=y
# CONFIG_MQTT_MSG_ID_INCREMENTAL is not set
CONFIG_MQTT_SKIP_PUBLISH_IF_DISCONNECTED=y
# CONFIG_MQTT_REPORT_DELETED_MESSAGES is not set
CONFIG_MQTT_USE_CUSTOM_CONFIG=y
CONFIG_MQTT_TCP_DEFAULT_PORT=1883
CONFIG_MQTT_SSL_DEFAULT_PORT=8883
CONFIG_MQTT_WS_DEFAULT_PORT=80
CONFIG_MQTT_WSS_DEFAULT_PORT=443
CONFIG_MQTT_BUFFER_SIZE=30000
CONFIG_MQTT_TASK_STACK_SIZE=30000
# CONFIG_MQTT_DISABLE_API_LOCKS is not set
CONFIG_MQTT_TASK_PRIORITY=1
CONFIG_MQTT_TASK_CORE_SELECTION_ENABLED=y
# CONFIG_MQTT_USE_CORE_0 is not set
CONFIG_MQTT_USE_CORE_1=y
# CONFIG_MQTT_CUSTOM_OUTBOX is not set
CONFIG_MQTT_OUTBOX_EXPIRED_TIMEOUT_MS=50000
# end of ESP-MQTT Configurations
euripedesrocha commented 5 months ago

@Totrasmek you are correct that the code is quite similar and also is the behavior, but there are some changes between versions, although I don't believe they would impact in this case.

Could you also explain your test scenario? I'm not sure I understood how you are emulating the poor network connection.

What I meant by internal error was that to have timeouts in the writing path would likely mean that we have our internal buffers full and will not be able to write.

Could you set "transport_base" to debug mode and share the logs? You will be able to identify the scenario by the presence of the message: poll_write: select - Timeout before any socket was ready!

Totrasmek commented 5 months ago

Hey @euripedesrocha. I'll explain my test scenario a bit better. I connect an ESP32 to a local wifi network. I connect a laptop running a mosquitto mqtt broker to the same network. I use esp-mqtt on the ESP32 to connect to the mosquitto broker. The ESP32 begins enqueing QOS0 messages to be sent to the broker. Under good network conditions, this works fine.

To simulate a bad network, I identify the network interface that the laptop is using to connect to the local wifi network by using ifconfig. I use that with the tc command to add latency to all network packets, or a random % of packets lost: sudo tc qdisc add dev <interface> root netem loss <percentage loss> I set to 10%. This gives the chance for the network to connect, send some packets, then for my network to connect, send some packets, then a transaction doesn't complete in time, and mqtt_resend_queued to call esp_mqtt_abort_connection(client) again due to a network timeout.

I've just now tried using esp_log_level_set("TRANSPORT_BASE",ESP_LOG_DEBUG); but it doesn't seem to be giving me anymore logs than normal. Here is an example:


11:22:19.045 > E (54187) MQTT_CLIENT: Writing didn't complete in specified timeout: errno=119
11:22:19.046 > E (54188) mqtt: MQTT_EVENT_ERROR
11:22:19.046 > E (54188) mqtt: esp-tls error: 0x0
11:22:19.047 > E (54188) mqtt: tls-stack error: 0x0
11:22:19.047 > E (54189) mqtt: transport socket errno: 0x0, string: Success
11:22:19.047 > E (54189) mqtt: tls cert verify flags: 0
11:22:19.049 > E (54190) MQTT_CLIENT: Error to resend data 
11:22:19.049 > I (54192) mqtt: MQTT_EVENT_DISCONNECTED
11:22:19.049 > E (54192) MQTT_CLIENT: Resend failed```

I'll see if I can get a bit more time to investigate further. Looks like are our timezones aren't lining up great, but I really do appreciate the support so far.
euripedesrocha commented 5 months ago

Hi @Totrasmek thanks for the added details. Regarding the logs, maybe it was introduced after 5.0 release, I would need to verify.
I will put some thoughts on the problem and see if I can find some improvement in the default scenario. A work around you can adopt is to adjust the timeout of the network paths to your case. I must say that any changes applied here will not be backported to 4.x branches and that to have access to the changes you need to update to more recent versions.

movivint commented 4 months ago

@Totrasmek Did you ever figure out a solution or workaround for this issue?

I am experiencing what appears to be the same error with my MQTT client that is running esp-idf 4.4.6 on an ESP32-S3.

euripedesrocha commented 4 months ago

@movivint could you please open anew issue with detailed description of the problem you are facing? I must point out that 4.4 branch reached EOL and changes will likely not be backported.