Closed JonathanWitthoeft closed 2 years ago
Hi @JonathanWitthoeft
Could you tell us something more about the application, especially related to mdns
component? Do you init and deinit the mdns frequently?
I don't see a reason why and where could the received pbuf be double-free'd. It's received by lwip's udp callback and the mdns parser is the only owner. I could only think of a memory corruption of some kind.
Would it be possible to provide a simple example or a project that would exhibit this issue?
The part I left out is that this is a certified HomeKit accessory. I imagine HomeKit is at a higher layer using the mDNS library, but it depends on DNS/mDNS for bonjour device discovery.
@david-cermak is there anything I can do to help debug this issue? Here is a different crash in pbuf_free that has a longer backtrace:
panic_abort at .../components/esp_system/panic.c:379
esp_system_abort at .../components/esp_system/system_api.c:112
abort at .../components/newlib/abort.c:46
__assert_func at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdlib/assert.c:62 (discriminator 8)
pbuf_free at .../components/lwip/lwip/src/core/pbuf.c:757
(inlined by) pbuf_free at .../components/lwip/lwip/src/core/pbuf.c:729
nd6_send_na at .../components/lwip/lwip/src/core/ipv6/nd6.c:1340
nd6_input at .../components/lwip/lwip/src/core/ipv6/nd6.c:549
icmp6_input at .../components/lwip/lwip/src/core/ipv6/icmp6.c:121
ip6_input at .../components/lwip/lwip/src/core/ipv6/ip6.c:1090
ethernet_input at .../components/lwip/lwip/src/netif/ethernet.c:229
tcpip_thread_handle_msg at .../components/lwip/lwip/src/api/tcpip.c:180
(inlined by) tcpip_thread at .../components/lwip/lwip/src/api/tcpip.c:154
@JonathanWitthoeft As mentioned earlier, I don't have a better suggestion than trying to locate the memory corruption. Could you please follow these guidelines https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/heap_debug.html#heap-corruption-detection
Let's start with setting CONFIG_HEAP_CORRUPTION_DETECTION to Comprehensive and see if some a new, more detailed info appears?
Unfortunately I am not able to replicate this issue, but our firmware is reporting the backtrace information to our cloud on panics, and there are many customer units with the same problem in the field. I plan on updating to the latest component changes and bug fixes and pushing an update to a customer that sees this crash several times daily. Hopefully it is a bug that has been fixed already. I will update this issue with my findings.
@JonathanWitthoeft I would still suggest updating the configuration, it would still crash (but I expect earlier and perhaps with some additional clue where the corruption comes from) so you'd see the backtrace eventually reported by your customers.
Hopefully it is a bug that has been fixed already.
I'm not aware of any recent fixes to related components (mdns, ethernet, lwip), but the crash might really disappear to appear somewhere else...
Back to my original question:
Do you init and deinit the mdns frequently?
Does you application deinitializes some components (not only mdns) or simply sets everything up and just run?
Does you application deinitializes some components (not only mdns) or simply sets everything up and just run?
mdns and other components are set up and run. The only thing we ever de-init is blufi once provisioned. These crashes are happening about once a day well after provisioned, so this should not be the issue.
I have set CONFIG_HEAP_CORRUPTION_DETECTION to Comprehensive and waited a day and saw the 2 crashes below. One is the same old crash and another one references multi_heap_poisoning
and has to do with wifi_malloc
panic_abort at .../components/esp_system/panic.c:379
esp_system_abort at .../esp_system/system_api.c:112
abort at .../components/newlib/abort.c:46
__assert_func at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdlib/assert.c:62 (discriminator 8)
pbuf_free at .../components/lwip/lwip/src/core/pbuf.c:757
(inlined by) pbuf_free at .../components/lwip/lwip/src/core/pbuf.c:729
_mdns_execute_action at .../components/mdns/mdns.c:4019
panic_abort at .../components/esp_system/panic.c:379
esp_system_abort at .../components/esp_system/system_api.c:112
abort at .../components/newlib/abort.c:46
lock_acquire_generic at .../components/newlib/locks.c:139
_lock_acquire_recursive at .../components/newlib/locks.c:167
__retarget_lock_acquire_recursive at .../components/newlib/locks.c:323
_vfiprintf_r at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/vfprintf.c:853 (discriminator 2)
fiprintf at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/fiprintf.c:48
__assert_func at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdlib/assert.c:58 (discriminator 8)
multi_heap_malloc at .../components/heap/multi_heap_poisoning.c:237
(inlined by) multi_heap_malloc at .../components/heap/multi_heap_poisoning.c:219
heap_caps_malloc at .../components/heap/heap_caps.c:145
heap_caps_malloc_prefer at .../components/heap/heap_caps.c:226
wifi_malloc at .../components/esp_wifi/esp32/esp_adapter.c:77
ieee80211_timer_process at ??:?
pp_timer_process at ??:?
lmac_stop_hw_txq at ??:?
timer_process_alarm at .../components/esp_timer/src/esp_timer.c:330
(inlined by) timer_task at .../components/esp_timer/src/esp_timer.c:349
I have realized that these crashes seem to only happen on a subset of units that have our Local API enabled. For these units we add an mdns text record, once, after HomeKit is initiated, and we see the mDNS record on a network scan:
// Build the mDNS TXT record
mdns_txt_item_t txtData[4];
int txtIdx = 0;
txtData[txtIdx].key = "txt_ver";
txtData[txtIdx].value = "2";
txtIdx += 1;
txtData[txtIdx].key = "serial";
txtData[txtIdx].value = "123ABC789";
txtIdx += 1;
txtData[txtIdx].key = "feature";
txtData[txtIdx].value = 1;
txtIdx += 1;
txtData[txtIdx].key = "feature2";
txtData[txtIdx].value =1;
txtIdx += 1;
status = mdns_service_add(NULL, "_my_api", "_tcp", myTcpPort, NULL, 0);
if (ESP_OK != status) {
gc_err("Failed to add mDNS service");
}
mdns_service_instance_name_set("_my_api", "_tcp", myHostname);
mdns_service_txt_set("_my_api", "_tcp", txtData, txtIdx);
Our local API implements a TCP server and also includes an implementation as a SDDP server.
I am going to look closer at the Local API code, but I am perplexed at why these crashes are happening and why heap poisoning is not pointing to any of my application code and only esp-idf components.
txtData[txtIdx].key = "feature";
txtData[txtIdx].value = 1;
Forgotten quote or a typo? That would explain the corruption... I'm afraid this won't compile clean, though.
I am perplexed at why these crashes are happening and why heap poisoning is not pointing to any of my application code and only esp-idf components.
The memory issues could be easily happening in IDF, too, but since you're not deinitinalizing the components often and using mdns via HomeKit (=widely used scenario) I was suspecting the app code. Moreover if you're seeing the crashes quite frequently (daily), I would think the memory corruption to be caught by testing or others users might experience similar problems.
Could you also please share some other portions of your code? Perhaps this would give us a hint or could try to reproduce it...
Forgotten quote or a typo? That would explain the corruption... I'm afraid this won't compile clean, though.
Sorry I was attempting to make the code more generic and removed the quotes on accident. The quotes are indeed there in my code
txtData[txtIdx].key = "feature";
txtData[txtIdx].value = "1";
I am going to do some further debugging of the local API. I will work on getting you some more code once I test a bit more.
@JonathanWitthoeft Any update about this issue? Were you able to recreate it with a reduced project that you can share?
I ran out of time debugging this issue and unfortunately can not put more effort into it right now. A daily crash (with plenty of memory and no apparent heap corruption) is what we are experiencing which recovers without the customers knowledge. There is a small group of our customers that use this feature, and the crash/reboot has no noticeable impact.
Could you please close this issue then? You can always reopen it when you have more details or time for debugging.
Thanks for reporting, will close due to short of feedback, feel free to reopen with more updates. Thanks.
Environment
git describe --tags
to find it): v4.3.1xtensa-esp32-elf-gcc --version
to find it): (crosstool-NG esp-2021r1) 8.4.0Problem Description
We are seeing occasional crashes from some of our customers. When this happens it appears to be when an
ACTION_RX_HANDLE
mdns action callspbuf_free(action->data.rx_handle.packet->pb);
and for some reason theLWIP_ASSERT("pbuf_free: p->ref > 0", p->ref > 0);
fails.Debug Logs