espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.84k stars 7.32k forks source link

Using Wi-Fi to publish MQTT data, transport timeout, then stuck. (IDFGH-13901) #14743

Open Sky-Soo-Ap opened 1 month ago

Sky-Soo-Ap commented 1 month ago

Answers checklist.

IDF version.

v5.3.1

Espressif SoC revision.

ESP32-D0WDR2-V3-V3 (revision v3.1)

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

Custom Board

Power Supply used.

USB

What is the expected behavior?

I expected Wi-Fi or MQTT disconnected would not be stuck here. It will keep re-connecting Wi-Fi or MQTT;

What is the actual behavior?

Using Wi-Fi to publish MQTT data, transport timeout, then stuck.

Steps to reproduce.

  1. Wi-Fi & BLE coexist;
  2. ESP32 keep receiving BLE notificaiton data;
  3. Using ESP32 Wi-Fi publish notification data to MQTT; ...

Debug Logs.

I (11:44:10.037) azp_mqtt: iometer/agent/devices/notification/bb:a0:50:78:97:da, drop packet count:0
I (11:44:10.248) azp_mqtt: iometer/agent/devices/notification/bb:a0:50:78:97:da, drop packet count:0
I (11:44:10.428) azp_mqtt: iometer/agent/devices/notification/bb:a0:50:78:97:da, drop packet count:0
W (11:44:20.670) transport_base: Poll timeout or error, errno=Success, fd=48, timeout_ms=10000
E (11:44:20.672) mqtt_client: Writing didn't complete in specified timeout: errno=0
E (11:44:20.680) azp_mqtt: MQTT_EVENT_DISCONNECTED

More Information.

No response

Sky-Soo-Ap commented 1 month ago

Hello, Let me add some information. When verbose-level log is turned on, there are no more logs, only the same three lines of logs. W (15:45:45.560) transport_base: Poll timeout or error, errno=Success, fd=48, timeout_ms=10000 E (15:45:45.560) mqtt_client: Writing didn't complete in specified timeout: errno=0 E (15:45:45.569) azp_mqtt: MQTT_EVENT_DISCONNECTED

Anyway, attach the verbose-level log. 2024-10-21_Verbose_Level.log

According to the verbose level log, when stuck, BLE still seems to be receiving data normally. Is it because MQTT has not reconnected or is it stuck on Wi-Fi? I have done Wi-Fi reconnection in the app, and if the Wi-Fi disconnected. MQTT will automatically reconnect when disconnected, right?

This problem is very serious and requires a manual reboot to return to normal.

euripedesrocha commented 1 month ago

Hi @Sky-Soo-Ap, the MQTT client will try to reconnect, unless you set it not to. To help me investigate the issue, could you share the configuration you have for the mqtt client? It looks like you don't have the mqtt client in verbose level log, I'm missing a few messages that would appear in debug level.

Sky-Soo-Ap commented 1 month ago

Hi Euripedesrocha, I’m attaching mqtt client settings code and sdkconfig files. MQTT_settings.txt sdkconfig.txt

How to set mqtt client to verbose level log? Is it using the API esp_log_level_set("*", ESP_LOG_VERBOSE) to set it?

euripedesrocha commented 1 month ago

Hi, about the logs from the code sent, you should use lower case tags to set the logs. Also, that memcpy for the username and password are unnecessary, just pass the pointers and the client will copy. At first glance, it also looks incorrect. Aren't you writing over the struct data?

Sky-Soo-Ap commented 1 month ago

Hi, about the logs from the code sent, migrated from old esp-idf v5.0.x. I will change it to lower case as following and trying to get the log. esp_log_level_set("*", ESP_LOG_INFO); // Is this line needed to change to ESP_LOG_VERBOSE?  esp_log_level_set("mqtt_client", ESP_LOG_VERBOSE);  esp_log_level_set("mqtt_example", ESP_LOG_VERBOSE);  esp_log_level_set("transport_base", ESP_LOG_VERBOSE);  esp_log_level_set("esp-tls", ESP_LOG_VERBOSE);  esp_log_level_set("transport", ESP_LOG_VERBOSE);  esp_log_level_set("outbox", ESP_LOG_VERBOSE);

Aren't you writing over the struct data? Sorry, I didn't catch what you were trying to say. Could you point it out? Thanks

euripedesrocha commented 1 month ago

About the logs: The first line set it to all the tags, the following esp_log_level_set define the level for specific components of interest. The tags are lower case from IDF > 5.0 About the memcpy, you If you are not allocating the necessary space that pointer is likely pointing to invalid memory. That copy is unnecessary anyway. The client do the copy already.

Sky-Soo-Ap commented 1 month ago

I turned on the verbose log level of "mqtt_client" and did not see any more logs. Anyway, the attachment is a recaptured log. 2024-10-23__mqtt_client_verbose_log_level.log

Sky-Soo-Ap commented 4 weeks ago

Hi euripedesrocha, Following are the reproducible steps: 1) ESP32 connects to phone Wi-Fi AP; 2) ESP32 connects to an online MQTT broker; 3) ESP32 keeps receiving notification data and publishes it to the broker; 4) Disable the phone 4G/5G network;

Sky-Soo-Ap commented 3 weeks ago

Hi euripedesrocha, Is there any progress? If this issue is not solved, the product cannot be shipped.

euripedesrocha commented 3 weeks ago

@Sky-Soo-Ap sorry for the delay, this seems to be unlikely an issue on the MQTT client, but I'm having a hard time identifying the root cause. To be able to help you, I'll need a code reproducer to evaluate the scenario. From the configuration code you sent, you should see the reconnection attempts from the mqtt client.

Sky-Soo-Ap commented 3 weeks ago

Hi euripedesrocha, Finally found the root cause, it was stuck in API vTaskSuspend. Sorry, I don't have detailed debugging here. The attachment is a reproducible program. tcp.zip

Following are the reproducible steps: ESP32 connects to phone Wi-Fi AP; ESP32 connects to an online MQTT broker; ESP32 keeps receiving notification data and publishes it to the broker; Disable the phone 4G/5G network;

euripedesrocha commented 3 weeks ago

I see. Have you considered using MQTT_SKIP_PUBLISH_IF_DISCONNECTED?

Sky-Soo-Ap commented 3 weeks ago

Hi euripedesrocha, The actual situation is this: sometimes Wi-Fi, or MQTT is disconnected and reconnected; To prevent packet loss, we store the notification data packet in a large queue and pause it by calling the API vTaskSuspend. Because QoS 0 is used to publish, mqtt client disconnected will cause packet loss. Why does calling API vTaskSuspend in an MQTT disconnected event get stuck?

euripedesrocha commented 3 weeks ago

@Sky-Soo-Ap from reading your code you sent, the only thing I see is the notify_task not being available and the task handler being NULL at that point. This would suspend the mqtt task. I have a few suggestions.

euripedesrocha commented 3 weeks ago

Hi @Sky-Soo-Ap I just tried the reproducer sent and wasn't able to reproduce the issue. Is there something missing in the code?

Sky-Soo-Ap commented 2 weeks ago

Hi euripedesrocha, The reproducer is able to reproduce this issue; following is the log file. Reproducer.log And, I checked that the task handler is not NULL after creation, then called “configASSERT(xNotifyTaskHandle);”.

void suspendNotifyTask() { if (xNotifyTaskHandle != NULL) { vTaskSuspend(xNotifyTaskHandle); ESP_LOGW(TAG, "Suspended publish notification packets"); } }

How does the API esp_mqtt_client_enqueue behave when the outbox is full? It will drop the current data or the first queued data?

euripedesrocha commented 2 weeks ago

@Sky-Soo-Ap I wasn't able to reproduce on my side using the code you sent. There were other parts of your system that you omitted in this reproducer that might affect the behavior?

When you have outbox memory exhausted, either by system without enough memory or outbox memory usage limit, you get an error, no messages are dropped from the outbox.

Sky-Soo-Ap commented 2 weeks ago

@euripedesrocha Could you try the bin file I make? Did you use IDF v5.3.1? mqtt_tcp.zip

no messages are dropped from the outbox.

So, the get an error data will drop, right?

euripedesrocha commented 2 weeks ago

@Sky-Soo-Ap I retried and noticed that I had one step incorrect and was able to reproduce it, thanks for the help.

Not exactly. The message is not created, your data isn't touched, so no data is lost/drop. You'll not be able to publish/enqueue at that point, but that is because your system memory is exhausted.

euripedesrocha commented 2 weeks ago

@Sky-Soo-Ap we were able to identify the source of your problem. The disconnect event is dispatched from the notify task context, at this point publish is running and holding the mutex of the mqtt client task. Since you suspend the notify task, the mutex is never released and the mqtt client isn't able to run and notify task never runs gain as well.

Sky-Soo-Ap commented 2 weeks ago

@euripedesrocha This is strange, because I intuitively think that the mqtt event handler is handled by the mqtt client task; the API esp_mqtt_client_publish does not explain it, so no one will know. Furthermore, the disconnected event will not be deemed relevant, while the published event will have a slight relevance. Will all mqtt events be handled by mqtt client task?

euripedesrocha commented 2 weeks ago

@Sky-Soo-Ap you are partially correct. The event is dispatched from the task context that is running at the moment, TBH it took me a while to figure out the deadlock here as well, I was considering the mqtt task had acquired the mutex.

I didn't follow your thoughts on relevance of the disconnected/published events. Could you elaborate?

Events will most of the time be dispatched from the mqtt task, I'll check other points where this could happen and improve the documentation about it.

Have you experimented with enqueue function? Does it solve your original problem?

Sky-Soo-Ap commented 2 weeks ago

@euripedesrocha

I didn't follow your thoughts on relevance of the disconnected/published events. Could you elaborate?

What I mean is that the API and event tasks must be disconnected; in this example, the upper layer calls the API, returns an error after time out, and the event generated by time out is processed by the mqtt task. If the API and event tasks are not disconnected, the upper layer calls the publish API, receives the published event, and then calls the publish API, which will form a loop, and the stack size will be full soon.

Have you experimented with enqueue function? Does it solve your original problem?

The queuing function does not meet the functional requirements. What I need is that when the queue is full, the oldest data is discarded and the latest is saved.

euripedesrocha commented 2 weeks ago

@Sky-Soo-Ap

What I mean is that the API and event tasks must be disconnected; in this example, the upper layer calls the API, returns an error after time out, and the event generated by time out is processed by the mqtt task. If the API and event tasks are not disconnected, the upper layer calls the publish API, receives the published event, and then calls the publish API, which will form a loop, and the stack size will be full soon.

The published event is dispatched only by the mqtt task. I still need to double-check, but I believe this behavior is only for the disconnection event.

The queuing function does not meet the functional requirements. What I need is that when the queue is full, the oldest data is discarded and the latest is saved.

Could still be a solution. The outbox behavior can be fully customized.

Sky-Soo-Ap commented 2 weeks ago

@euripedesrocha

The published event is dispatched only by the mqtt task. I still need to double-check, but I believe this behavior is only for the disconnection event.

Will you solve this problem? Disconnect event dispatched by the mqtt task.

Could still be a solution. The outbox behavior can be fully customized.

I didn't find the relevant APIs or structural data to customize it.

euripedesrocha commented 2 weeks ago

@Sky-Soo-Ap

Will you solve this problem? Disconnect event dispatched by the mqtt task.

This behavior isn't likely to change at the moment. I'll work on it later though.

I didn't find the relevant APIs or structural data to customize it.

You can find the relevant example here