Open lumapu opened 1 month ago
Low memory issue I suspect. Perhaps drastically increase the minimum free memory requirement (https://github.com/bertmelis/espMqttClient/blob/main/src/Config.h#L32) or use the memory pool feature (https://github.com/bertmelis/espMqttClient/blob/main/src/Config.h#L65).
Your application is able to handle failure to publish? (publish returns zero)
PS Arduino is currently at v 3.0.5 so I don't know what you mean with 6.0.7
Still the same problem as https://github.com/bertmelis/espMqttClient/discussions/164.
3 hypothesis:
you're right, same issue as in #164 Will close this one. Thank you for your quick response. This time the ESP crashed really fast after booting, so I can't imagine that the ESP goes out of memory. It feels more that somehow the memory was already freed before sending the data out.
Keep this open. The other one is converted to a discussion.
It is most strange. The library allocates the entire packet on heap memory and only releases the memory after it is completely sent. Sending in this case means passed to Arduino's WiFiClient::send
. In my understanding, this creates a second copy of this data for the underlying lwip. I don't think it is a memory issue although it appears as one.
I'm searching for a concurrency/deadlock issue.
Are you using builtin WiFi or an ethernet adapter (w5500)?
Other observations: If you use platformio, you might want to consider upgrading to https://github.com/pioarduino/platform-espressif32 AsyncTCP and non-async is mixed in your code. Is it possible to (as a test) disable the features that use AsyncTCP?
Are you using builtin WiFi or an ethernet adapter (w5500)?
In this scenario here: yes For the issue in discussion I don't remember, but can try to figure out
AsyncTCP
I'll give it a try. Never thought in this direction.
This is going to be trial and error bughunting.
Or somebody needs to have a divine intervention.
Is it possible to (as a test) disable the features that use AsyncTCP?
All MqTT and AsyncWebserver stuff is based on AsyncTCP. Don't know if it really makes sense to disable it. The system is then not "useful" anymore. The issue does not occur on my system, it happens to one of the users on a really random pattern of days.
To rule out memory exhaustion issues you could try with the memory pool enabled. The library then allocates its memory statically on initialization (underlying libraries not taken into account).
If you need some guidance with that, let me know.
Today I have the same issue again. From that I read a bit about xSemaphoreTake
in FreeRTOS. It tells that it is necessary to check the return value even if you set portMAX_DELEY
as the timeout.
If feel that the issue can may related to that, because MqttClient::loop()
as well as MqttClient::publish
are capsulated by a semaphore.
Do you think it makes sense to check for the mentioned functions of MqttClient
the return value to be pdTRUE
?
I also read some older article (2006) where the solution was to yield()
after xSemaphoreGive()
to have a context switch.
I don't know too much about that but on the other hand I think these extra conditions will improve.
In the meantime I try to patch the library locally and test it a few days. Hopefully it helps to cover this problem.
quick'n'dirty:
Patch which also is compatible with ESP8266
Regarding the return value of xSemaphoreTake
: could you provide a link to the explanation? Not that I'm not willing to take it into account. I'm also here to learn.
Every iteration of loop
has a yield. Yielding after API-methods like publish
are to be done by the user.
Another possibility would be to not use blocking semaphores in the library loop
. After all, the operations can be executed in the next iteration whereas publishing might not be able to block.
sure, here are the links I visited yesterday: https://www.freertos.org/FreeRTOS_Support_Forum_Archive/February_2006/freertos_xSemaphoreTake_fails_before_timeout_1441441.html https://www.freertos.org/Documentation/02-Kernel/04-API-references/10-Semaphore-and-Mutexes/12-xSemaphoreTake
I also asked ChatGPT (in German), this was the important output:
Was bedeutet 0xFFFFFFFF oder portMAX_DELAY?
portMAX_DELAY (0xFFFFFFFF): Wenn du portMAX_DELAY als Timeout verwendest, bedeutet das, dass der Task unendlich lange wartet, bis er den Mutex übernehmen kann. Der Task blockiert also so lange, bis der Mutex frei wird, und er wird erst dann den kritischen Abschnitt betreten.
Warum weiterhin die Überprüfung notwendig ist?
Selbst bei einem unendlichen Timeout (portMAX_DELAY) kann es unter bestimmten Umständen vorkommen, dass xSemaphoreTake() nicht erfolgreich ist. Zum Beispiel:
Fehler in der Semaphore-Initialisierung: Wenn der Mutex oder die Semaphore selbst nicht korrekt initialisiert wurde, könnte xSemaphoreTake() fehlschlagen.
Systemunterbrechungen oder Exceptions: Es gibt Szenarien, in denen ein Task durch Systemunterbrechungen, Speicherprobleme oder andere Systemfehler daran gehindert wird, den Mutex zu übernehmen, auch wenn er theoretisch unendlich wartet. In diesem Fall würde xSemaphoreTake() ebenfalls pdFALSE zurückgeben.
Priority Inversion oder Deadlocks: Selbst wenn der Task unendlich wartet, könnte es in komplexen Systemen zu Deadlocks oder zu einer Priority Inversion kommen, die das erfolgreiche Übernehmen des Mutex verhindert.
Question: do you use tasks other than the Arduino task itself in your application? Which of the tasks use MQTT?
You might want to disable the separate MQTT task and just call loop()
from your code so you will have less to worry about concurrency.
Describe the bug ESP crashed, coredump was read from the device
Which platform, esp8266 or esp32? ESP32-S3 Do you use TLS or not? no TLS Do you use an IDE (Arduino, Platformio...)? Platformio Which version of the Arduino framework? 6.7.0
Please include any debug output and/or decoded stack trace if applicable.
Stack trace
``` =============================================================== ==================== ESP32 CORE DUMP START ==================== Crashed task handle: 0x3fcf6990, name: '', GDB name: 'process 1070557584' ================== CURRENT THREAD REGISTERS =================== exccause 0x1d (StoreProhibitedCause) excvaddr 0x0 epc1 0x42079209 epc2 0x0 epc3 0x0 epc4 0x0 epc5 0x0 epc6 0x0 eps2 0x0 eps3 0x0 eps4 0x0 eps5 0x0 eps6 0x0 [New process 1070557584] [New process 1070558984] [New process 1070544264] [New process 1070525568] [New process 1070313812] [New process 1070536528] [New process 1070551560] [New process 1070556032] [New process 1070345904] [New process 1070273552] [New process 1070534828] [New process 1070514224] [New process 1070299724] [New process 1070342604] [New process 1070550388] [Current thread is 1 (process 1070557584)] ==================== CURRENT THREAD STACK ===================== ======================== THREADS INFO ========================= pc 0x40377da5 0x40377da5Expected behaviour no crash
To Reproduce not that easy - don't know how to do it.
Additional context Can you determine where to search? Does it happen in the MqTT library or in my code? For me it feels that the issue happens while the library publishes the internal queue.