espressif / esp-mqtt

ESP32 mqtt component
Apache License 2.0
603 stars 255 forks source link

Bug: esp_mqtt_client_publish deadlock (IDFGH-9038) #244

Closed liliumjsn closed 1 year ago

liliumjsn commented 1 year ago

Description

There is a situation where the mqtt broker connection is not steady and the esp_mqtt_client_publish gets a deadlock at MQTT_API_LOCK(client).

In this scenario, the esp_mqtt_client_publish is called, then the MQTT_API_LOCK(client) and then the esp_mqtt_abort_connection is called here:

https://github.com/espressif/esp-mqtt/blob/dffabb067fb3c39f486033d2e47eb4b1416f0c82/mqtt_client.c#L2026

which also tries to lock the API and waits there forever.

Reproduction

This is reproduced by letting the client connect to the mqtt broker, then switch off the internet connection and start calling the esp_mqtt_client_publish. The mqqt client state is not yet refreshed so there is nothing to check before calling the publish function. So after getting mqtt_write_data(client) != ESP_OK the esp_mqtt_abort_connection is called and the deadlock is created

euripedesrocha commented 1 year ago

Hi @liliumjsn, thanks for the bug report. The locking API uses a Recursive Mutex, and it's taken by the owner task and for each of Take there is a related Given. Can you share some logs of the reported bug? Could you also share idf version?

liliumjsn commented 1 year ago

The issue is that when the internet connection (not WiFi) is unstable, there are certain cases where the publish function won't return. I thought that this was due to a deadlock caused by esp_mqtt_abort_connection but you response about the recursive mutex wipes that out. I will try to catch the exact cause of the publish not returning and update the issue. We are on IDF 4.2.1.

euripedesrocha commented 1 year ago

Hi @liliumjsn, I tried to reproduce it but wasn't able to. What happens when you have an active Wi-Fi, without internet connection, is that it takes longer to detect that there's no connection available and abort with the MQTT client switching to the disconnected state. This behavior is from the network stack, and you can explore LWIP configurations to find the best option for your case. You may try:

TCP high speed retransmission - LWIP_TCP_HIGH_SPEED_RETRANSMISSION LWIP_TCP_MAXRTX - Maximum retransmission of data segments LWIP_TCP_RTO_TIME - Default TCP rto time

david-cermak commented 1 year ago

We will mention the potential delay before the connection is assumed closed in docs.