espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.45k stars 7.25k forks source link

ASSERT_PARAM(-218959118 0), in arch_main.c at line 343 (IDFGH-6126) #7805

Open ChrisEAlfred opened 2 years ago

ChrisEAlfred commented 2 years ago

Opening - same as https://github.com/espressif/esp-idf/issues/6517. Still occurs in 4.3.1

alexmantaut commented 2 years ago

Same here, it happens to me on 4.2.2.

It is less frequent than it used to be, but it still happens

ASSERT_PARAM(-218959118 0), in arch_main.c at line 343
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0). 

Core  0 register dump:
PC      : 0x40088b64  PS      : 0x00060434  A0      : 0x801395d0  A1      : 0x3ffbf8f0  
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x60008048  A5      : 0x00000000  
A6      : 0x00000004  A7      : 0x3ffbdcb8  A8      : 0x80088b64  A9      : 0x3ffbf8d0  
A10     : 0x0000f2f2  A11     : 0x00000037  A12     : 0x00000014  A13     : 0xffffffff  
A14     : 0x00000000  A15     : 0xfffffffc  SAR     : 0x00000004  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x40088a9c  LEND    : 0x40088aa3  LCOUNT  : 0x00000000  
Core  0 was running in ISR context:
EPC1    : 0x401ccdf3  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40088b64

Backtrace:0x40088b61:0x3ffbf8f0 0x401395cd:0x3ffbf910 0x40019fb5:0x3ffbf930 0x40046683:0x3ffbf960 0x40047515:0x3ffbf980 0x4008bbb1:0x3ffbf9a0 0x4008b805:0x3ffbf9c0 0x4008df12:0x3ffbf9e0 0x4008f18f:0x3ffbfa00 0x4008645d:0x3ffbfa20 0x401399f5:0x3ffb5f10 0x40139fd5:0x3ffb5f30 0x4009487e:0x3ffb5f60
Core  1 register dump:
PC      : 0x4009296d  PS      : 0x00060d34  A0      : 0x80092ad5  A1      : 0x3ffbcc00  
A2      : 0x3ffc76a8  A3      : 0x00060d20  A4      : 0x800949c4  A5      : 0x3ffbcb20  
A6      : 0x00000000  A7      : 0x3ffbc500  A8      : 0x00000015  A9      : 0x00000001  
A10     : 0x000001a4  A11     : 0x00000000  A12     : 0x00060d20  A13     : 0x00060d23  
A14     : 0x00000001  A15     : 0x3ffbc500  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

Backtrace:0x4009296a:0x3ffbcc00 0x40092ad2:0x3ffbcc20 0x40092cf0:0x3ffbcc40 0x4009487e:0x3ffbcc60
AxelLin commented 2 years ago

@Alvin1Zhang This issue has been reported quite long time ago, and at least 3 people reported that the fix does not really fix the issue. https://github.com/espressif/esp-idf/issues/6517#issuecomment-891186177

jmaha commented 2 years ago

This ongoing issue is the reason we removed the ESP32 from a product we're developing.

TianaESP commented 2 years ago

Thanks for reporting the issue. The crash was because after the controller ran out of memory, malloc failure happens and assert is triggered. The issue is usually caused by the slow processing of the application layer, which caused the controller memory could not be released in time. We did some optimization before, however, if the latest IDF version still has the issue, please refer to the following suggestions and give it a try.

  1. Reduce time-consuming operations at the application layer, such as a large number of log printing.
  2. Increase the CPU speed, for example, increase the CPU frequency to 240Mhz. If dual-core is supported, run bluedroid host and Controller on different cores.
  3. If you are scanning at the same time: Try to reduce controller memory consumption, for example, by making the following changes in the configuration: (1) Change BLE adv report flow control number to 20 and BLE adv lost event threshold value to 100 (2) Disable BLE full scan feature supported (3) Enable BLE queue congestion check (4) Disable Report adv data and scan response individually when BLE active scan (5) Enable duplicate Scan when setting scan parameters to filter out duplicate broadcast packets

Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.

Tiana, Espressif

Platin21 commented 2 years ago

@TianaESP we have a similar error but 1) is not possible as we need the tasks todo the stuff they need todo. 2) Update on that side we already run at 240hz more is not possible 3) We can certainly try that but we are not using bluedroid we are using the nimble stack which seems to also use libbtdm in which this assertion seems to happen.

May i ask what is the particular reason why there is not source for this part? As that would make debugging easier. (Not much of a fan of re'ing such stuff or reading xtensa assembly.. )

We can reporduce the issue with a run that takes around +3 hours. Not sure if there can be a minimal repro made as you guy's would still require the hardware on which we do the tests. We can supply stacktraces and registers if that may help. Or if you wish we can certainly setup a remote session or something in order to get this working again. This is basically a show stopper for our entire project so if this can't be fixed we have to move to a diffrent RTOS and do a lot of the work again.

Platin21 commented 2 years ago

I'm not to sure if the controller memory is the problem as we have 4096 allocated to it so that is plenty or do you mean there is a leak inside the controller task? That would be very unfortnunate.

AxelLin commented 2 years ago

Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.

@TianaESP

Just to clarify about your comments above.

So you should be able to reproduce it by

In additional, this issue does not happen immediately. The device can run for hours then it suddenly hit assert. So it's really difficult for application developers to know what could be wrong.

Platin21 commented 2 years ago

@AxelLin in our case we have dual-core running, we have the highest frequnecy set, we have it still happening. Our host controller has 4096 as a buffer in memory.

TianaESP commented 2 years ago

@Platin21 Can you please provide the specific commit id of IDF and describe what the host task did?

Platin21 commented 2 years ago

@TianaESP We are using version 4.3.1 and 4.4 but it was the same on ealier builds. We didn't want to go full master as we are not sure about the stability of the other features that we use with that. Are there any recent changes that should have fixed this?

Platin21 commented 2 years ago

@TianaESP I can certainly provide you a trace not sure if that will help alot though. We don't have JTAG on the board so debugging is somewhat difficult.

(3332242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3332242) task_wdt:  - IDLE (CPU 1)
(3332242) task_wdt: Tasks currently running:
(3332242) task_wdt: CPU 0: IDLE
(3332242) task_wdt: CPU 1: btController
(3332242) task_wdt: Print CPU 0 (current core) backtrace

Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

(3332242) task_wdt: Print CPU 1 backtrace

Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3332399) UDS: Read data id fe20
(3332406) UDS: Read data id fe1f
(3332425) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3332426) UDP SERVER: JoDaHomas
(3332430) UDP SERVER: Waiting for data
(3337242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3337242) task_wdt:  - IDLE (CPU 1)
(3337242) task_wdt: Tasks currently running:
(3337242) task_wdt: CPU 0: IDLE
(3337242) task_wdt: CPU 1: btController
(3337242) task_wdt: Print CPU 0 (current core) backtrace

Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

 (3337242) task_wdt: Print CPU 1 backtrace

Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3337321) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3337321) UDP SERVER: JoDaHomas
(3337325) UDP SERVER: Waiting for data
(3337647) UDS: Read data id fe20
(3337653) UDS: Read data id fe1f
(3342242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3342242) task_wdt:  - IDLE (CPU 1)
(3342242) task_wdt: Tasks currently running:
(3342242) task_wdt: CPU 0: IDLE
(3342242) task_wdt: CPU 1: btController
(3342242) task_wdt: Print CPU 0 (current core) backtrace

Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

 (3342242) task_wdt: Print CPU 1 backtrace

Backtrace: 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3342356) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3342357) UDP SERVER: JoDaHomas
(3342360) UDP SERVER: Waiting for data
(3342928) UDS: Read data id fe20
(3342934) UDS: Read data id fe1f
(3347242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3347242) task_wdt:  - IDLE (CPU 1)
(3347242) task_wdt: Tasks currently running:
(3347242) task_wdt: CPU 0: IDLE
(3347242) task_wdt: CPU 1: btController
(3347242) task_wdt: Print CPU 0 (current core) backtrace

Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

(3347242) task_wdt: Print CPU 1 backtrace

Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3347382) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3347383) UDP SERVER: JoDaHomas
(3347388) UDP SERVER: Waiting for data
(3348169) UDS: Read data id fe20
(3348178) UDS: Read data id fe1f
ASSERT_PARAM(-218959118 0), in arch_main.c at line 343                                               <==== The reported ASSERT
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0). 

Core  0 register dump:
PC      : 0x4018bf52  PS      : 0x00060334  A0      : 0x800d4d40  A1      : 0x3ffbc7e0  
A2      : 0x00000000  A3      : 0x00000001  A4      : 0x00000014  A5      : 0x000000a5  
A6      : 0x000000a5  A7      : 0x00060023  A8      : 0x801132ae  A9      : 0x3ffbc7c0  
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x00000014  A13     : 0x000000a5  
A14     : 0x000000a5  A15     : 0x00060023  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

Backtrace:0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820 

Core  1 register dump:
PC      : 0x40089689  PS      : 0x00060f34  A0      : 0x80173c7c  A1      : 0x3ffbeec0  
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x60008048  A5      : 0x00000000  
A6      : 0x00000004  A7      : 0x3ffc21ec  A8      : 0x80089684  A9      : 0x3ffbeea0  
A10     : 0x00000000  A11     : 0x00000037  A12     : 0x00000014  A13     : 0xffffffff  
A14     : 0x00000000  A15     : 0xfffffffc  SAR     : 0x00000004  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x400895bc  LEND    : 0x400895c3  LCOUNT  : 0x00000000  

(Edit removed color escape sequences)

And as i said ealier we have no problem spending some time on this bug as this is pretty important for us to work.

Platin21 commented 2 years ago

@TianaESP And here is the decoded addresses sadly still some ??'s:

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40089686:0x3ffbeec00x40173c79:0x3ffbeee0 0x40019fb5:0x3ffbef00 0x400466c3:0x3ffbef30 0x4008c772:0x3ffbef50 0x4008c45d:0x3ffbef70 0x4008eb6a:0x3ffbef90 0x4008fde7:0x3ffbefb0 0x40083299:0x3ffbefd0 0x40174091:0x3ffe7c20 0x40174659:0x3ffe7c40
0x40089686: ??:?
0x40019fb5: ??:0
0x400466c3: ??:0
0x4008c772: ??:?
0x4008c45d: ??:?
0x4008eb6a: ??:?
0x4008fde7: intc.c:?
0x40083299: #idf path#/freertos/port/xtensa/xtensa_vectors.S:1105
0x40174091: ??:?
0x40174659: ??:?
Platin21 commented 2 years ago

In this case it was running for around 3 hours which is for our use case not that uncommon. And as i said we are willing to take some time to debug the issue but for us it's not a clear case. And we can only see that libbtdm as inside the controller task we have a failure.

Platin21 commented 2 years ago

After this the device has a Guru Mediation Error and Restarts which kills the BLE Connection and makes it worthless for the customer.

skyESP commented 2 years ago

1.Coud you give the commit id of the idf you are using? 2 Could you describe the demo you are using? We have a demo which the device do scan , adv and connect to four other devices. But we run about one day and could not reproduce the issue.

Platin21 commented 2 years ago

@skyESP 1: rel4.4 https://github.com/espressif/esp-idf/tree/release/v4.4 rel4.3 https://github.com/espressif/esp-idf/tree/release/v4.3 This report was from a run with version 4.4 (should be exact commit)

2: Just a normal BLE process with RPA Enabled. As in start with adv then do a connect then we have GATT Attributes that we retrieve every 5 seconds. Our process also does do a scan for sensors but that is not active during the tests.

We also have a UDP Echo server running which also gets polled in a 5 second interval which is running over Wifi the BLE and UDP Requests have a offset of 2.5 Seconds so that they don’t interfere.

We also have a Additional Task for retrieving CAN Messages via one of the Uart‘s where we also poll each 5 seconds

(Edit) We have just a single device connected that does the polling (Samsung A22 Phone, Tested with other devices same outcomes)

skyESP commented 2 years ago

@Platin21 1.During the test, Does the Adv is still continue? If it does,what's the frequence?

  1. Do you use bluedroid or nimble as your host layer? 3.Could you try to run the wifi and bt on different core. For example wifi 0 core ,bt 1 core.
Platin21 commented 2 years ago

@skyESP

  1. No it doesn’t continue
  2. As i already said we use nimble
  3. What part of bt? The host our task or the controller? We switched some stuff already around and changed priorities but that made no notable change..
Platin21 commented 2 years ago

And no we have no option to switch to bluedroid as that would not fit in memory anymore.

skyESP commented 2 years ago

@Platin21 OK,Thanks for your information. We are trying to reproduce the issue.

For the third question. Just the controller of bt . You can do as following. We have menuconfig to chose running on core0 or core1. run idf.py menuconfig For wifi: (Top) → Component config → Wi-Fi → WiFi Task Core ID For bt (Top) → Component config → Bluetooth → Bluetooth controller(ESP32 Dual Mode Bluetooth) → The cpu core which bluetooth controller run

Another two questions 1.What's the priority of the task for retrieving CAN Messages ? Does it higher than controller task? 2.From your describe you just use polling way not using interrupt for handling the CAN Messages , Am I right?

Platin21 commented 2 years ago

@skyESP Let me check if that changes anything and i get back to you.

  1. 20 from max 25. The BLE one is also slightly evalated i think it’s 15 there.
  2. It’s done via Interrupt, Polling is only used for retrieving the uds messages
skyESP commented 2 years ago

OK, You set the priority of the controller task lower than the task to handle the can message. Here please do two tests.

  1. Set the controller task higher than the task to handle the CAN message.If still have problem please do the step 2. 2.Reduce the load or stop the the task to handle the CAN message and check if the issue is still here.

I suspect that the controller task is blocked by other task or the interrupt.

Platin21 commented 2 years ago

@skyESP where can one change the controller priority? Seems to be not a option inside Component config -> Bluetooth -> Bluetooth Controller. Or what is the default level there?

  1. Isn’t really a fix but for testing sure i can do that.
skyESP commented 2 years ago

@Platin21 1.You can change the priority in the code where you create the task. 2.From our experience the controller should not be blocked by other tasks or interrupt for long time. Through I do not know how long and how much work you did in your task,so I advance you to do this to test and verify my suspection. If it works,this can be a fix.

Platin21 commented 2 years ago

@skyESP you mean the task that handles the BLE so neither the controller nor the host?

I would first need to know under which value it needs to be to not take priority.

skyESP commented 2 years ago

@Platin21 hi, 1.You had better keep the priority of controller and host task as our demo example. 2.The task you created yourself should lower than the controller and host ble task. The priority should be: ble controller task > ble host task > application task ( build by your self) 3.If this still can not work, can you give us your code?

Platin21 commented 2 years ago
  1. I didn’t change anything there what i was not sure about is if our task should have a higher prio than the can one
  2. This was the case anyways i think meanwhile i dunno what the default prio of these other two tasks is.
  3. Possible but that needs to be elaborated in a Email. And i suspect it makes more sense that we setup something as i already said and you maybe can guess that the pcb on which the esp sits is custom. @skyESP
Platin21 commented 2 years ago

@skyESP so back to the priorities the host task seems to run on 21 and the LL task seems to run on 24 so our can task is acutally already lower in priority. And to the other tasks on the system all of them run with priority 0 so should not be possbile for it to interfere at all then.

skyESP commented 2 years ago

hi,@Platin21 Sorry for replay late. Yes,Priority 0 of other tasks should not interfere them.
For the example ble_ibeacon demo in our IDF sdk. The controller task is 23,hci host task is 22, BTU task is 20 ,BTC task is 19.

Platin21 commented 2 years ago

Yeah that makes it clearer now. We will setup something tomorrow and i will leave my response here. But yeah it sounds very promising. We are now switching the priorities to be under a value of 18. Would be a huge step for us forward if that works out.

Platin21 commented 2 years ago

After switching the "priority and increasing" the buffer the bug still happens. @skyESP Can we arrange a meeting in some ways to get maybe a clearer outcome? We don't really have the option to reduce the load on the system as i already said.

Platin21 commented 2 years ago

I can share also the crash log again but it is basically the samething as it was.

skyESP commented 2 years ago

@Platin21 hi,If you stop the task to handle the CAN message. Did this issue still happen?

Platin21 commented 2 years ago

@skyESP if i stop this task i can’t use CAN so even if it doesn’t happen then that’s not a solution.

skyESP commented 2 years ago

Ok,Then what did you do in the interrupt function of the CAN? Does this part cost lot of time? Could you share the related code?

Platin21 commented 2 years ago

Its basically a loop that does rx and tx of can messages and stores them in a static array. Looks something like that:

static void
can_isr( void *arg )
{
  uint8_t events;
  events = can_base->interrupt;

  if( /* check for b0 event */ ) {
    can_rx_isr();
  }
  if(/* check for b1 event */) {
    can_tx_isr();
  }
  if( /* Check if overrun or bus error occured */) {
    can_error_flag = 1u;
    can_error_code_capture = can_base->err_code;
  }
}

In side can_rx_isr() is just a function that stores the frame we got and validates it. In the tx side we write to the transmit fifo and that seems to be it. (Just a byte copy)

I have to say though without the load of the uds the controller seems to just work fine.

Platin21 commented 2 years ago

What we are now trying is to make a task that simulates a load and see if that makes any diffrence.

skyESP commented 2 years ago

hi,@Platin21 I think you can post a message from the can_isr to other tasks to handle the "can_rx_isr()" and "can_tx_isr()". It should not do lots of work in the interrupt.

Platin21 commented 2 years ago

@skyESP any estimates how long would be to long to todo something in the interrupt? Without hitting the controller to much?

skyESP commented 2 years ago

The working cases of the controller is complicated(Scan state,adv state,connecting state while receive and transmit data in different frequency), so it is hard to have such data.

Platin21 commented 2 years ago

@skyESP just a thing on that side the int in our case isn't happening on the same core is a problem with the int then still possible?

skyESP commented 2 years ago

Do you mean you can handle the int in another core? I think you can have a try. The ble controller and host is default running on the core 0. You can make the core 1 handle the int. But I am not sure this way can help.

ChrisEAlfred commented 2 years ago

We have found that this error is provoked if you regularly use SPIFFS. Maybe during erase cycles. Unfortunately, without the source code this is almost impossible to catch.

In an earlier post you stated: "The crash was because after the controller ran out of memory, malloc failure happens and assert is triggered. The issue is usually caused by the slow processing of the application layer, which caused the controller memory could not be released in time"

What is the amount of memory required? What is the longest "processing of the application layer" that is allowed?

Platin21 commented 2 years ago

@ChrisEAlfred we are not useing spiffs we are using fat and nvs. In our case basically all memory that the normal esp32 has i think 1k bytes is still left.

Yeah that processing was also interesting for me but it seems like that’s nit easy to say.

ChrisEAlfred commented 2 years ago

SPIFFS and NV storage use the same memory.

So I guess the question is ... does using the ESP32 NV in any way provoke this problem? Our tests appear to indicate that this is the case.

Platin21 commented 2 years ago

@ChrisEAlfred thing is we didn’t have this controller error when i just run it without having the int from the can driver and after 8 hours still nothing. Teh average time for that to happen was 6 to 4 hours. If it’s the NVS then i’m not sure why this didn’t happen in this case.

Platin21 commented 2 years ago

@skyESP it seems it's related to our CAN Side. Could it be that somehow something gets overlaped in the sheduler that makes this problem happening?

@ChrisEAlfred you could be right we are doing some stuff on our can side with the nvs for storing the state in which we where. Not 100% though if that makes the problems.. How did you test for that bug on your side? Also is it the writting or the reading that makes these problems on your side?

Platin21 commented 2 years ago

@ChrisEAlfred we disabled this setting and still get the error so from what i can see there isn't a diffrence.

Platin21 commented 2 years ago

So for the CAN Side we can't move that to use Polling as that makes the timings not to comply. @skyESP any other ideas?

bshmid commented 2 years ago

The following link helped me cure the problem: https://stackoverflow.com/questions/51750377/how-to-disable-interrupt-watchdog-in-esp32-or-increase-isr-time-limit