Open ChrisEAlfred opened 2 years ago
Same here, it happens to me on 4.2.2.
It is less frequent than it used to be, but it still happens
ASSERT_PARAM(-218959118 0), in arch_main.c at line 343
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).
Core 0 register dump:
PC : 0x40088b64 PS : 0x00060434 A0 : 0x801395d0 A1 : 0x3ffbf8f0
A2 : 0x00000001 A3 : 0x00000000 A4 : 0x60008048 A5 : 0x00000000
A6 : 0x00000004 A7 : 0x3ffbdcb8 A8 : 0x80088b64 A9 : 0x3ffbf8d0
A10 : 0x0000f2f2 A11 : 0x00000037 A12 : 0x00000014 A13 : 0xffffffff
A14 : 0x00000000 A15 : 0xfffffffc SAR : 0x00000004 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x40088a9c LEND : 0x40088aa3 LCOUNT : 0x00000000
Core 0 was running in ISR context:
EPC1 : 0x401ccdf3 EPC2 : 0x00000000 EPC3 : 0x00000000 EPC4 : 0x40088b64
Backtrace:0x40088b61:0x3ffbf8f0 0x401395cd:0x3ffbf910 0x40019fb5:0x3ffbf930 0x40046683:0x3ffbf960 0x40047515:0x3ffbf980 0x4008bbb1:0x3ffbf9a0 0x4008b805:0x3ffbf9c0 0x4008df12:0x3ffbf9e0 0x4008f18f:0x3ffbfa00 0x4008645d:0x3ffbfa20 0x401399f5:0x3ffb5f10 0x40139fd5:0x3ffb5f30 0x4009487e:0x3ffb5f60
Core 1 register dump:
PC : 0x4009296d PS : 0x00060d34 A0 : 0x80092ad5 A1 : 0x3ffbcc00
A2 : 0x3ffc76a8 A3 : 0x00060d20 A4 : 0x800949c4 A5 : 0x3ffbcb20
A6 : 0x00000000 A7 : 0x3ffbc500 A8 : 0x00000015 A9 : 0x00000001
A10 : 0x000001a4 A11 : 0x00000000 A12 : 0x00060d20 A13 : 0x00060d23
A14 : 0x00000001 A15 : 0x3ffbc500 SAR : 0x0000001d EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Backtrace:0x4009296a:0x3ffbcc00 0x40092ad2:0x3ffbcc20 0x40092cf0:0x3ffbcc40 0x4009487e:0x3ffbcc60
@Alvin1Zhang This issue has been reported quite long time ago, and at least 3 people reported that the fix does not really fix the issue. https://github.com/espressif/esp-idf/issues/6517#issuecomment-891186177
This ongoing issue is the reason we removed the ESP32 from a product we're developing.
Thanks for reporting the issue. The crash was because after the controller ran out of memory, malloc failure happens and assert is triggered. The issue is usually caused by the slow processing of the application layer, which caused the controller memory could not be released in time. We did some optimization before, however, if the latest IDF version still has the issue, please refer to the following suggestions and give it a try.
Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.
Tiana, Espressif
@TianaESP we have a similar error but 1) is not possible as we need the tasks todo the stuff they need todo. 2) Update on that side we already run at 240hz more is not possible 3) We can certainly try that but we are not using bluedroid we are using the nimble stack which seems to also use libbtdm in which this assertion seems to happen.
May i ask what is the particular reason why there is not source for this part? As that would make debugging easier. (Not much of a fan of re'ing such stuff or reading xtensa assembly.. )
We can reporduce the issue with a run that takes around +3 hours. Not sure if there can be a minimal repro made as you guy's would still require the hardware on which we do the tests. We can supply stacktraces and registers if that may help. Or if you wish we can certainly setup a remote session or something in order to get this working again. This is basically a show stopper for our entire project so if this can't be fixed we have to move to a diffrent RTOS and do a lot of the work again.
I'm not to sure if the controller memory is the problem as we have 4096 allocated to it so that is plenty or do you mean there is a leak inside the controller task? That would be very unfortnunate.
Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.
@TianaESP
Just to clarify about your comments above.
So you should be able to reproduce it by
In additional, this issue does not happen immediately. The device can run for hours then it suddenly hit assert. So it's really difficult for application developers to know what could be wrong.
@AxelLin in our case we have dual-core running, we have the highest frequnecy set, we have it still happening. Our host controller has 4096 as a buffer in memory.
@Platin21 Can you please provide the specific commit id of IDF and describe what the host task did?
@TianaESP We are using version 4.3.1 and 4.4 but it was the same on ealier builds. We didn't want to go full master as we are not sure about the stability of the other features that we use with that. Are there any recent changes that should have fixed this?
@TianaESP I can certainly provide you a trace not sure if that will help alot though. We don't have JTAG on the board so debugging is somewhat difficult.
(3332242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3332242) task_wdt: - IDLE (CPU 1)
(3332242) task_wdt: Tasks currently running:
(3332242) task_wdt: CPU 0: IDLE
(3332242) task_wdt: CPU 1: btController
(3332242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3332242) task_wdt: Print CPU 1 backtrace
Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3332399) UDS: Read data id fe20
(3332406) UDS: Read data id fe1f
(3332425) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3332426) UDP SERVER: JoDaHomas
(3332430) UDP SERVER: Waiting for data
(3337242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3337242) task_wdt: - IDLE (CPU 1)
(3337242) task_wdt: Tasks currently running:
(3337242) task_wdt: CPU 0: IDLE
(3337242) task_wdt: CPU 1: btController
(3337242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3337242) task_wdt: Print CPU 1 backtrace
Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3337321) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3337321) UDP SERVER: JoDaHomas
(3337325) UDP SERVER: Waiting for data
(3337647) UDS: Read data id fe20
(3337653) UDS: Read data id fe1f
(3342242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3342242) task_wdt: - IDLE (CPU 1)
(3342242) task_wdt: Tasks currently running:
(3342242) task_wdt: CPU 0: IDLE
(3342242) task_wdt: CPU 1: btController
(3342242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3342242) task_wdt: Print CPU 1 backtrace
Backtrace: 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3342356) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3342357) UDP SERVER: JoDaHomas
(3342360) UDP SERVER: Waiting for data
(3342928) UDS: Read data id fe20
(3342934) UDS: Read data id fe1f
(3347242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3347242) task_wdt: - IDLE (CPU 1)
(3347242) task_wdt: Tasks currently running:
(3347242) task_wdt: CPU 0: IDLE
(3347242) task_wdt: CPU 1: btController
(3347242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3347242) task_wdt: Print CPU 1 backtrace
Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3347382) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3347383) UDP SERVER: JoDaHomas
(3347388) UDP SERVER: Waiting for data
(3348169) UDS: Read data id fe20
(3348178) UDS: Read data id fe1f
ASSERT_PARAM(-218959118 0), in arch_main.c at line 343 <==== The reported ASSERT
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).
Core 0 register dump:
PC : 0x4018bf52 PS : 0x00060334 A0 : 0x800d4d40 A1 : 0x3ffbc7e0
A2 : 0x00000000 A3 : 0x00000001 A4 : 0x00000014 A5 : 0x000000a5
A6 : 0x000000a5 A7 : 0x00060023 A8 : 0x801132ae A9 : 0x3ffbc7c0
A10 : 0x00000000 A11 : 0x00000001 A12 : 0x00000014 A13 : 0x000000a5
A14 : 0x000000a5 A15 : 0x00060023 SAR : 0x0000001d EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Backtrace:0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820
Core 1 register dump:
PC : 0x40089689 PS : 0x00060f34 A0 : 0x80173c7c A1 : 0x3ffbeec0
A2 : 0x00000001 A3 : 0x00000000 A4 : 0x60008048 A5 : 0x00000000
A6 : 0x00000004 A7 : 0x3ffc21ec A8 : 0x80089684 A9 : 0x3ffbeea0
A10 : 0x00000000 A11 : 0x00000037 A12 : 0x00000014 A13 : 0xffffffff
A14 : 0x00000000 A15 : 0xfffffffc SAR : 0x00000004 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x400895bc LEND : 0x400895c3 LCOUNT : 0x00000000
(Edit removed color escape sequences)
And as i said ealier we have no problem spending some time on this bug as this is pretty important for us to work.
@TianaESP And here is the decoded addresses sadly still some ??'s:
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40089686:0x3ffbeec00x40173c79:0x3ffbeee0 0x40019fb5:0x3ffbef00 0x400466c3:0x3ffbef30 0x4008c772:0x3ffbef50 0x4008c45d:0x3ffbef70 0x4008eb6a:0x3ffbef90 0x4008fde7:0x3ffbefb0 0x40083299:0x3ffbefd0 0x40174091:0x3ffe7c20 0x40174659:0x3ffe7c40
0x40089686: ??:?
0x40019fb5: ??:0
0x400466c3: ??:0
0x4008c772: ??:?
0x4008c45d: ??:?
0x4008eb6a: ??:?
0x4008fde7: intc.c:?
0x40083299: #idf path#/freertos/port/xtensa/xtensa_vectors.S:1105
0x40174091: ??:?
0x40174659: ??:?
In this case it was running for around 3 hours which is for our use case not that uncommon. And as i said we are willing to take some time to debug the issue but for us it's not a clear case. And we can only see that libbtdm as inside the controller task we have a failure.
After this the device has a Guru Mediation Error and Restarts which kills the BLE Connection and makes it worthless for the customer.
1.Coud you give the commit id of the idf you are using? 2 Could you describe the demo you are using? We have a demo which the device do scan , adv and connect to four other devices. But we run about one day and could not reproduce the issue.
@skyESP 1: rel4.4 https://github.com/espressif/esp-idf/tree/release/v4.4 rel4.3 https://github.com/espressif/esp-idf/tree/release/v4.3 This report was from a run with version 4.4 (should be exact commit)
2: Just a normal BLE process with RPA Enabled. As in start with adv then do a connect then we have GATT Attributes that we retrieve every 5 seconds. Our process also does do a scan for sensors but that is not active during the tests.
We also have a UDP Echo server running which also gets polled in a 5 second interval which is running over Wifi the BLE and UDP Requests have a offset of 2.5 Seconds so that they don’t interfere.
We also have a Additional Task for retrieving CAN Messages via one of the Uart‘s where we also poll each 5 seconds
(Edit) We have just a single device connected that does the polling (Samsung A22 Phone, Tested with other devices same outcomes)
@Platin21 1.During the test, Does the Adv is still continue? If it does,what's the frequence?
@skyESP
And no we have no option to switch to bluedroid as that would not fit in memory anymore.
@Platin21 OK,Thanks for your information. We are trying to reproduce the issue.
For the third question. Just the controller of bt . You can do as following. We have menuconfig to chose running on core0 or core1. run idf.py menuconfig For wifi: (Top) → Component config → Wi-Fi → WiFi Task Core ID For bt (Top) → Component config → Bluetooth → Bluetooth controller(ESP32 Dual Mode Bluetooth) → The cpu core which bluetooth controller run
Another two questions 1.What's the priority of the task for retrieving CAN Messages ? Does it higher than controller task? 2.From your describe you just use polling way not using interrupt for handling the CAN Messages , Am I right?
@skyESP Let me check if that changes anything and i get back to you.
OK, You set the priority of the controller task lower than the task to handle the can message. Here please do two tests.
I suspect that the controller task is blocked by other task or the interrupt.
@skyESP where can one change the controller priority? Seems to be not a option inside Component config -> Bluetooth -> Bluetooth Controller. Or what is the default level there?
@Platin21 1.You can change the priority in the code where you create the task. 2.From our experience the controller should not be blocked by other tasks or interrupt for long time. Through I do not know how long and how much work you did in your task,so I advance you to do this to test and verify my suspection. If it works,this can be a fix.
@skyESP you mean the task that handles the BLE so neither the controller nor the host?
I would first need to know under which value it needs to be to not take priority.
@Platin21 hi, 1.You had better keep the priority of controller and host task as our demo example. 2.The task you created yourself should lower than the controller and host ble task. The priority should be: ble controller task > ble host task > application task ( build by your self) 3.If this still can not work, can you give us your code?
@skyESP so back to the priorities the host task seems to run on 21 and the LL task seems to run on 24 so our can task is acutally already lower in priority. And to the other tasks on the system all of them run with priority 0 so should not be possbile for it to interfere at all then.
hi,@Platin21
Sorry for replay late.
Yes,Priority 0 of other tasks should not interfere them.
For the example ble_ibeacon demo in our IDF sdk.
The controller task is 23,hci host task is 22, BTU task is 20 ,BTC task is 19.
Yeah that makes it clearer now. We will setup something tomorrow and i will leave my response here. But yeah it sounds very promising. We are now switching the priorities to be under a value of 18. Would be a huge step for us forward if that works out.
After switching the "priority and increasing" the buffer the bug still happens. @skyESP Can we arrange a meeting in some ways to get maybe a clearer outcome? We don't really have the option to reduce the load on the system as i already said.
I can share also the crash log again but it is basically the samething as it was.
@Platin21 hi,If you stop the task to handle the CAN message. Did this issue still happen?
@skyESP if i stop this task i can’t use CAN so even if it doesn’t happen then that’s not a solution.
Ok,Then what did you do in the interrupt function of the CAN? Does this part cost lot of time? Could you share the related code?
Its basically a loop that does rx and tx of can messages and stores them in a static array. Looks something like that:
static void
can_isr( void *arg )
{
uint8_t events;
events = can_base->interrupt;
if( /* check for b0 event */ ) {
can_rx_isr();
}
if(/* check for b1 event */) {
can_tx_isr();
}
if( /* Check if overrun or bus error occured */) {
can_error_flag = 1u;
can_error_code_capture = can_base->err_code;
}
}
In side can_rx_isr() is just a function that stores the frame we got and validates it. In the tx side we write to the transmit fifo and that seems to be it. (Just a byte copy)
I have to say though without the load of the uds the controller seems to just work fine.
What we are now trying is to make a task that simulates a load and see if that makes any diffrence.
hi,@Platin21 I think you can post a message from the can_isr to other tasks to handle the "can_rx_isr()" and "can_tx_isr()". It should not do lots of work in the interrupt.
@skyESP any estimates how long would be to long to todo something in the interrupt? Without hitting the controller to much?
The working cases of the controller is complicated(Scan state,adv state,connecting state while receive and transmit data in different frequency), so it is hard to have such data.
@skyESP just a thing on that side the int in our case isn't happening on the same core is a problem with the int then still possible?
Do you mean you can handle the int in another core? I think you can have a try. The ble controller and host is default running on the core 0. You can make the core 1 handle the int. But I am not sure this way can help.
We have found that this error is provoked if you regularly use SPIFFS. Maybe during erase cycles. Unfortunately, without the source code this is almost impossible to catch.
In an earlier post you stated: "The crash was because after the controller ran out of memory, malloc failure happens and assert is triggered. The issue is usually caused by the slow processing of the application layer, which caused the controller memory could not be released in time"
What is the amount of memory required? What is the longest "processing of the application layer" that is allowed?
@ChrisEAlfred we are not useing spiffs we are using fat and nvs. In our case basically all memory that the normal esp32 has i think 1k bytes is still left.
Yeah that processing was also interesting for me but it seems like that’s nit easy to say.
SPIFFS and NV storage use the same memory.
So I guess the question is ... does using the ESP32 NV in any way provoke this problem? Our tests appear to indicate that this is the case.
@ChrisEAlfred thing is we didn’t have this controller error when i just run it without having the int from the can driver and after 8 hours still nothing. Teh average time for that to happen was 6 to 4 hours. If it’s the NVS then i’m not sure why this didn’t happen in this case.
@skyESP it seems it's related to our CAN Side. Could it be that somehow something gets overlaped in the sheduler that makes this problem happening?
@ChrisEAlfred you could be right we are doing some stuff on our can side with the nvs for storing the state in which we where. Not 100% though if that makes the problems.. How did you test for that bug on your side? Also is it the writting or the reading that makes these problems on your side?
@ChrisEAlfred we disabled this setting and still get the error so from what i can see there isn't a diffrence.
So for the CAN Side we can't move that to use Polling as that makes the timings not to comply. @skyESP any other ideas?
The following link helped me cure the problem: https://stackoverflow.com/questions/51750377/how-to-disable-interrupt-watchdog-in-esp32-or-increase-isr-time-limit
Opening - same as https://github.com/espressif/esp-idf/issues/6517. Still occurs in 4.3.1