Open Sky-Soo-Ap opened 1 month ago
Hello, Let me add some information. When verbose-level log is turned on, there are no more logs, only the same three lines of logs. W (15:45:45.560) transport_base: Poll timeout or error, errno=Success, fd=48, timeout_ms=10000 E (15:45:45.560) mqtt_client: Writing didn't complete in specified timeout: errno=0 E (15:45:45.569) azp_mqtt: MQTT_EVENT_DISCONNECTED
Anyway, attach the verbose-level log. 2024-10-21_Verbose_Level.log
According to the verbose level log, when stuck, BLE still seems to be receiving data normally. Is it because MQTT has not reconnected or is it stuck on Wi-Fi? I have done Wi-Fi reconnection in the app, and if the Wi-Fi disconnected. MQTT will automatically reconnect when disconnected, right?
This problem is very serious and requires a manual reboot to return to normal.
Hi @Sky-Soo-Ap, the MQTT client will try to reconnect, unless you set it not to. To help me investigate the issue, could you share the configuration you have for the mqtt client? It looks like you don't have the mqtt client in verbose level log, I'm missing a few messages that would appear in debug level.
Hi Euripedesrocha, I’m attaching mqtt client settings code and sdkconfig files. MQTT_settings.txt sdkconfig.txt
How to set mqtt client to verbose level log? Is it using the API esp_log_level_set("*", ESP_LOG_VERBOSE) to set it?
Hi, about the logs from the code sent, you should use lower case tags to set the logs.
Also, that memcpy
for the username and password are unnecessary, just pass the pointers and the client will copy. At first glance, it also looks incorrect. Aren't you writing over the struct data?
Hi, about the logs from the code sent, migrated from old esp-idf v5.0.x. I will change it to lower case as following and trying to get the log. esp_log_level_set("*", ESP_LOG_INFO); // Is this line needed to change to ESP_LOG_VERBOSE? esp_log_level_set("mqtt_client", ESP_LOG_VERBOSE); esp_log_level_set("mqtt_example", ESP_LOG_VERBOSE); esp_log_level_set("transport_base", ESP_LOG_VERBOSE); esp_log_level_set("esp-tls", ESP_LOG_VERBOSE); esp_log_level_set("transport", ESP_LOG_VERBOSE); esp_log_level_set("outbox", ESP_LOG_VERBOSE);
Aren't you writing over the struct data? Sorry, I didn't catch what you were trying to say. Could you point it out? Thanks
About the logs: The first line set it to all the tags, the following esp_log_level_set
define the level for specific components of interest. The tags are lower case from IDF > 5.0
About the memcpy
, you If you are not allocating the necessary space that pointer is likely pointing to invalid memory. That copy is unnecessary anyway. The client do the copy already.
I turned on the verbose log level of "mqtt_client" and did not see any more logs. Anyway, the attachment is a recaptured log. 2024-10-23__mqtt_client_verbose_log_level.log
Hi euripedesrocha, Following are the reproducible steps: 1) ESP32 connects to phone Wi-Fi AP; 2) ESP32 connects to an online MQTT broker; 3) ESP32 keeps receiving notification data and publishes it to the broker; 4) Disable the phone 4G/5G network;
Hi euripedesrocha, Is there any progress? If this issue is not solved, the product cannot be shipped.
@Sky-Soo-Ap sorry for the delay, this seems to be unlikely an issue on the MQTT client, but I'm having a hard time identifying the root cause. To be able to help you, I'll need a code reproducer to evaluate the scenario. From the configuration code you sent, you should see the reconnection attempts from the mqtt client.
Hi euripedesrocha, Finally found the root cause, it was stuck in API vTaskSuspend. Sorry, I don't have detailed debugging here. The attachment is a reproducible program. tcp.zip
Following are the reproducible steps: ESP32 connects to phone Wi-Fi AP; ESP32 connects to an online MQTT broker; ESP32 keeps receiving notification data and publishes it to the broker; Disable the phone 4G/5G network;
I see. Have you considered using MQTT_SKIP_PUBLISH_IF_DISCONNECTED
?
Hi euripedesrocha, The actual situation is this: sometimes Wi-Fi, or MQTT is disconnected and reconnected; To prevent packet loss, we store the notification data packet in a large queue and pause it by calling the API vTaskSuspend. Because QoS 0 is used to publish, mqtt client disconnected will cause packet loss. Why does calling API vTaskSuspend in an MQTT disconnected event get stuck?
@Sky-Soo-Ap from reading your code you sent, the only thing I see is the notify_task
not being available and the task handler being NULL at that point. This would suspend the mqtt task.
I have a few suggestions.
You can use the App trace library to debug the situation and find the issue. We'll spend some time on it also.
For the problem you are trying to solve, I'll suggest using esp_mqtt_client_enqueue
instead of publish to handle this. Even QoS0 messages will be stored in the outbox queue and you can keep pushing to it, limited of course by the available memory.
Hi @Sky-Soo-Ap I just tried the reproducer sent and wasn't able to reproduce the issue. Is there something missing in the code?
Hi euripedesrocha, The reproducer is able to reproduce this issue; following is the log file. Reproducer.log And, I checked that the task handler is not NULL after creation, then called “configASSERT(xNotifyTaskHandle);”.
void suspendNotifyTask() { if (xNotifyTaskHandle != NULL) { vTaskSuspend(xNotifyTaskHandle); ESP_LOGW(TAG, "Suspended publish notification packets"); } }
How does the API esp_mqtt_client_enqueue behave when the outbox is full? It will drop the current data or the first queued data?
@Sky-Soo-Ap I wasn't able to reproduce on my side using the code you sent. There were other parts of your system that you omitted in this reproducer that might affect the behavior?
When you have outbox memory exhausted, either by system without enough memory or outbox memory usage limit, you get an error, no messages are dropped from the outbox.
@euripedesrocha Could you try the bin file I make? Did you use IDF v5.3.1? mqtt_tcp.zip
no messages are dropped from the outbox.
So, the get an error data will drop, right?
@Sky-Soo-Ap I retried and noticed that I had one step incorrect and was able to reproduce it, thanks for the help.
Not exactly. The message is not created, your data isn't touched, so no data is lost/drop. You'll not be able to publish/enqueue at that point, but that is because your system memory is exhausted.
@Sky-Soo-Ap we were able to identify the source of your problem. The disconnect event is dispatched from the notify task context, at this point publish is running and holding the mutex of the mqtt client task. Since you suspend the notify task, the mutex is never released and the mqtt client isn't able to run and notify task never runs gain as well.
@euripedesrocha This is strange, because I intuitively think that the mqtt event handler is handled by the mqtt client task; the API esp_mqtt_client_publish does not explain it, so no one will know. Furthermore, the disconnected event will not be deemed relevant, while the published event will have a slight relevance. Will all mqtt events be handled by mqtt client task?
@Sky-Soo-Ap you are partially correct. The event is dispatched from the task context that is running at the moment, TBH it took me a while to figure out the deadlock here as well, I was considering the mqtt task had acquired the mutex.
I didn't follow your thoughts on relevance of the disconnected/published events. Could you elaborate?
Events will most of the time be dispatched from the mqtt task, I'll check other points where this could happen and improve the documentation about it.
Have you experimented with enqueue function? Does it solve your original problem?
@euripedesrocha
I didn't follow your thoughts on relevance of the disconnected/published events. Could you elaborate?
What I mean is that the API and event tasks must be disconnected; in this example, the upper layer calls the API, returns an error after time out, and the event generated by time out is processed by the mqtt task. If the API and event tasks are not disconnected, the upper layer calls the publish API, receives the published event, and then calls the publish API, which will form a loop, and the stack size will be full soon.
Have you experimented with enqueue function? Does it solve your original problem?
The queuing function does not meet the functional requirements. What I need is that when the queue is full, the oldest data is discarded and the latest is saved.
@Sky-Soo-Ap
What I mean is that the API and event tasks must be disconnected; in this example, the upper layer calls the API, returns an error after time out, and the event generated by time out is processed by the mqtt task. If the API and event tasks are not disconnected, the upper layer calls the publish API, receives the published event, and then calls the publish API, which will form a loop, and the stack size will be full soon.
The published event is dispatched only by the mqtt task. I still need to double-check, but I believe this behavior is only for the disconnection event.
The queuing function does not meet the functional requirements. What I need is that when the queue is full, the oldest data is discarded and the latest is saved.
Could still be a solution. The outbox behavior can be fully customized.
@euripedesrocha
The published event is dispatched only by the mqtt task. I still need to double-check, but I believe this behavior is only for the disconnection event.
Will you solve this problem? Disconnect event dispatched by the mqtt task.
Could still be a solution. The outbox behavior can be fully customized.
I didn't find the relevant APIs or structural data to customize it.
@Sky-Soo-Ap
Will you solve this problem? Disconnect event dispatched by the mqtt task.
This behavior isn't likely to change at the moment. I'll work on it later though.
I didn't find the relevant APIs or structural data to customize it.
You can find the relevant example here
Answers checklist.
IDF version.
v5.3.1
Espressif SoC revision.
ESP32-D0WDR2-V3-V3 (revision v3.1)
Operating System used.
Windows
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
None
Development Kit.
Custom Board
Power Supply used.
USB
What is the expected behavior?
I expected Wi-Fi or MQTT disconnected would not be stuck here. It will keep re-connecting Wi-Fi or MQTT;
What is the actual behavior?
Using Wi-Fi to publish MQTT data, transport timeout, then stuck.
Steps to reproduce.
Debug Logs.
More Information.
No response