BT(?) causes random Guru Mediation Error

EDsteve commented 9 months ago

@LIFsCode @OOHehir Not sure if these reboots are connected but i think it is because of BT turning OFF or ON.

Situation 1

After a few hours of recordon_detecton i triggered the knock sensor and then it rebooted with Ruru Mediation Error. This happened only once for now. Log files: Using commit cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reboot while recordon_detecton.txt Bootlog after crash.txt

Situation 2

Start the ELOC and just wait. When BT turns off due to the BT time out it restarts. This doesn't happeb all the time but a few times already since a few weeks.

Using commit: 29325f68d7c831e7b754bf81b368c9930524c5eb Reset when BT turns off.txt

Using commit: 8cc4b9949c1cb9bd53aed99f9485e725fab7dcb0 Reset when BT turns off 2.txt

Using commit: cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reset when BT turns off 3.txt

OOHehir commented 9 months ago

@EDsteve @LIFsCode This could be perhaps a memory exhaustion issue, I made some changes as discussed, it may solve this issue also.

EDsteve commented 9 months ago

@LIFsCode @OOHehir

I am adding more reboot logs here, which hopefully give more insight to find the problem.

1. Reboots with assert failed (Happend only once in my tests)

How it happened: ELOC boots, i connect with app, start recordon_detectoff, device reboots after a few seconds
Commit d2dd26c8f8e37bfb0c3e9861ef266469df3d2101
Log: Reboot_assert failed_during_recording.txt

2. Another Guru Reboot (Happens very seldom)

Same as "Situation 2" in my first post. ELOC reboots when BT turns off. So that issue seems to be not solved yet.
Commit: d2dd26c8f8e37bfb0c3e9861ef266469df3d2101
Log: Guro_BT-timeout_d2dd26c8f8e37bfb0c3e9861ef266469df3d2101.txt

3. One more Guru

How it happened: After 12h of detecting with three features. I tried to connect with BT, then i get the Gurur Reboot
Older commit: 0fbf78037fb15d81b67e9988404505de88315c81
Log: Guro_Reboot_when_connecting_with_BT_after_12h.txt
Older commit. Not sure if that issue is still happening with the latest commits

OOHehir commented 9 months ago

@EDsteve Thanks ED. If you can use the method to trace where in the code the crashes occur it would really help. You might remember we did this before, here's a link to the documentation.

EDsteve commented 9 months ago

@OOHehir Okay. I see what is going on now. The VScode Serial monitor doesn't do backtrace to the files (which i have used lately). But the PlatformIO Monitor does.

Assert failed:

How it happened: ELOC boots, i connect with app, start recordon_detectoff, devicd reboots after BT time out
Commit: f4a6a8e6c3d3ceb6e098aaf379cfd48cfb49d682
Log (With backtrace): Assert failed after BT timeout.txt

LIFsCode commented 9 months ago

@EDsteve @OOHehir this looks like some heap or heap fragmentation problem or a stack overflow which then overwrites the subsequent heap sections.

I try to add some additional debugging info and try to reproduce it at home

EDsteve commented 9 months ago

@LIFsCode Forgot to mention. That this does only happen occasionaly. Haven't really seen a pattern, but i suspect it mostly happens with the first BT timout after a reboot. Not sure if that can make sense.

EDsteve commented 9 months ago

@OOHehir @LIFsCode

I catched another Guro error with Backtrace. This time in a loop:

Guru Reboot Loop

How it happens: ELOC boots up, BT times out after 60 sec -> Reboot Once i manually restart the ELOC with the On Off switch. The reboot loop dissapears and all is back to normal
I had the same reboot loop once a few weeks ago. And today again. Using this commit d69080d9ce6eb05b3a9967dc8f74919bccfcbef0
Log (With backtrace): Guru Reboot Loop.txt

EDIT: The buzzer wakes up the acelerometer every time it buzzes. So when 60 sec time out happens, the buzzer wakes up the knock sensor at the same time whcih also wakes up BT again. Only on this specific PCB the knock sensor wakes up from the buzzer. But maybe that is the reason for the reboot loop? Will play around with the accelerometer sensitivity to find a better value.

OOHehir commented 9 months ago

@EDsteve @LIFsCode I just had a quick look at this log & perhaps its a memory issue also? There seems to be lots occuring simultaneously; file rotation, ArduinoJSON memory pool, file writing & a semaphore queue being established. That's just an educated guess from first reading though..

EDsteve commented 9 months ago

@LIFsCode @OOHehir

Another Guru reboot. Do more log files help you to figure out where the problem comes from? Or should i stop uploading them?

Guru Reboot after interactin with BT

How it happens: ELOC was in recOn_detectON mode for >24 hours, and after one of my rabdom BT-test-connections it rebooted
Commit: d69080d9ce6eb05b3a9967dc8f74919bccfcbef0
Config: stamptest1706795112761.txt
Log (With backtrace): Guru.txt

EDsteve commented 9 months ago

From another device i am getting BT errors. But no reboot and still working normal. Not sure if that is connected to the reboots. But i will post it anyway in case it helps: E (133702274) BT_OSI: config_save, err_code: 0x4

BT Error in log

How it happened: ELOC was in recOn_detectON mode for >24 hours, and after one of my random BT-test-connections this error appeares now every time BT turns on or off. But the ELOC app can connect without problems.
Commit: d69080d9ce6eb05b3a9967dc8f74919bccfcbef0
Here the log BT-Error.txt

OOHehir commented 9 months ago

@LIFsCode @OOHehir

Another Guru reboot.

Do more log files help you to figure out where the problem comes from? Or should i stop uploading them?

Guru Reboot after interactin with BT

How it happens: ELOC was in recOn_detectON mode for >24 hours, and after one of my rabdom BT-test-connections it rebooted

Commit: d69080d9ce6eb05b3a9967dc8f74919bccfcbef0

Config: stamptest1706795112761.txt

Log (With backtrace): Guru.txt

This seems a bit different:

Core 0 panic'ed (Interrupt wdt timeout on CPU0)

OOHehir commented 9 months ago

From another device i am getting BT errors. But no reboot and still working normal. Not sure if that is connected to the reboots. But i will post it anyway in case it helps: E (133702274) BT_OSI: config_save, err_code: 0x4

BT Error in log
* How it happened: ELOC was in recOn_detectON mode for >24 hours, and after one of my random BT-test-connections this error appeares now every time BT turns on or off. But the ELOC app can connect without problems.

* Commit: [d69080d](https://github.com/LIFsCode/ELOC-3.0/commit/d69080d9ce6eb05b3a9967dc8f74919bccfcbef0)

* Here the log [BT-Error.txt](https://github.com/LIFsCode/ELOC-3.0/files/14143010/BT-Error.txt)

I don't see this error code in the documentation but perhaps a wrapper for this?

LIFsCode commented 9 months ago

@LIFsCode @OOHehir

Another Guru reboot. Do more log files help you to figure out where the problem comes from? Or should i stop uploading them?

Guru Reboot after interactin with BT

* How it happens: ELOC was in recOn_detectON mode for >24 hours, and after one of my rabdom BT-test-connections it rebooted

* Commit: [d69080d](https://github.com/LIFsCode/ELOC-3.0/commit/d69080d9ce6eb05b3a9967dc8f74919bccfcbef0)

* Config: [stamptest1706795112761.txt](https://github.com/LIFsCode/ELOC-3.0/files/14142906/stamptest1706795112761.txt)

* Log (With backtrace): [Guru.txt](https://github.com/LIFsCode/ELOC-3.0/files/14142820/Guru.after.SD.card.check.txt)

Ok this is a different topic: The Core0 is stucked in an ISR (Interrupt Service Routine).

Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0). 

Core  0 register dump:
PC      : 0x4000bfe7  PS      : 0x00060835  A0      : 0x80085305  A1      : 0x3ffbe5d0
A2      : 0x00000001  A3      : 0x00060823  A4      : 0xfffffff0  A5      : 0x00060023
A6      : 0x00000000  A7      : 0x3ffdf580  A8      : 0x00000008  A9      : 0x3ffbe4e0  
A10     : 0x00000004  A11     : 0x00060024  A12     : 0x00060023  A13     : 0x00060023
A14     : 0x00060023  A15     : 0x00000001  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x400928d8  LEND    : 0x400928f4  LCOUNT  : 0xffffffff
Core  0 was running in ISR context:
EPC1    : 0x401779a3  EPC2    : 0x400869c6  EPC3    : 0x400869b1  EPC4    : 0x400869b1

The only thing I could image causing this, is the knock sensor, which has an interrupt pin to wake the ELOC. @EDsteve are you knocking on the ELOC? I think maybe the the knock sensor is too sensitive. The interrupt is edge triggered so I assume the pin is toggling like crazy, but I haven't seen it here, so I can't tell for sure.

@EDsteve: if that happens again, could you please measure the "Accel_Trigger" signal on the PCB

LIFsCode commented 9 months ago

Ok, I found something else, which might give another explanation to the WDT timeout: https://esp32.com/viewtopic.php?t=33221

Unfortunatelly I could reproduce the reboot only a single time: WDStuck.txt

Test: Check BT core pinning

This might be related to the above discribed issue, though I'm not sure. I will move the BT core to Core 1 and and have to ask @EDsteve to rerun the tests. Sorry it's frustrating, but I don't have any better idea.

Test: Check effects of sd card logging

@EDsteve : Do you also have these issues when the SD card logging is disabled? Since this adds a lot of file system handling even to integrated functions (as bluetooth, etc.) this might cause troubles.

I splittet optimizing the sd card logs in another issue this will be a bit more work, not sure if I manage to do it before the next deadline. If disabling the sd card log solved these issues, than that would be at least a workaround

EDsteve commented 9 months ago

@LIFsCode No worries. I don't mind firmware tests :) Okay. I will run some devices with and some without log and see what is happening. To make sure: I am using the PerfDebugging branch for testing from now on?

Accelerometer Sensitivity: I have raised the sensitivity threshold from 10 to 30. Which still works fine when i knock on the case with the PCB inside. With this new threshold it should not wake up any more when i walk through my house (wooden floor) :)

EDsteve commented 9 months ago

@LIFsCode @OOHehir

Quick update about my tests: Only recently i used the PerfDebugging branch for tests. So that needs more time for results. But i have done around 30 tests (long term) with these three firmwares during the last two weeks. d2dd26c8f8e37bfb0c3e9861ef266469df3d2101 f4a6a8e6c3d3ceb6e098aaf379cfd48cfb49d683 d69080d9ce6eb05b3a9967dc8f74919bccfcbef0

Last two weeks impressions:

RecON_DetectON - > Reboots happen less frequently. I still have some ELOCs running since >10 days
RecON_DetectOFF -> Reboots happen much more frequently. (Maybe because the serial has a much higher output with constantly writing I2SMEMSSampler: inference buffer overrun). None of the ELOCs made it longer than 3 days (>6 devices).
Most of the reboots happen after the 60 sec time out from BT.
What might also fuel the reboots is that the knock sensor gets triggered every time the buzzer beeps. I have tried higher frequencies (800Hz and up) for the beeps which does not trigger the knock sensor any more.
Logging ON or OFF does not seem to make a big difference (but more tests are needed)

PerfDebugging branch

i definitely get much more reboots during RecON_DetectOFF with this branch. (even with logging OFF)
I can't say that this branch is running more stable, but more tests are needed

Accel_Trigger pin: Fabi seems to be on the right track :) I have measured weird signals (almost constantly at 3.3V) on the Accel_trigger pin. I observed that twice. And it goes high right after the 60 sec BT time out. How is the trigger pin connected to BT? I still have to do more test for a bigger picture. But i am sure you might be able to use that info already.

Accel Pin in Idle mode: Screenshot_20240205-165713_Pokit

After 60 second BT time out: Screenshot_20240205-140950_Pokit Screenshot_20240205-141154_Pokit

Some more logs with PerfDebugging branch: Guru WDT Reboot loop after 60sec time out.txt Assert failed - after 60 sec BT - log off.txt

LIFsCode commented 9 months ago

@EDsteve ok that's pretty good data thanks :) For the BT idle & Accel trigger pin measurement:

does idle mean no record no detection?
is the ELOC always rebooting or sometimes also just re-enabling Bluetooth?

I'm sure the issue is related to the power management, i.e. the pin has a SW configured pull down in the ESP, if this pin is turned off it might end up floating. If I2S sampling and BT is inactive the esp will enter light sleep mode automatically and this also affects the pin configs.

I will check that.

Anyway this does not explain the other reboots which appear during recording.

EDsteve commented 9 months ago

@LIFsCode

Idle meaning in "normal operation" waiting for an input.
BT reenabling itself without booting happened before. But that was caused by the buzzer. I can't recall any BT reenabling without the knock sensor being involved.

OOHehir commented 9 months ago

@EDsteve @LIFsCode Currently the 'buffer overrun' messages occur even if the AI inference & wav writing is not in progress. To cut down on the output I change it to only display when the respective task is active (i.e inference or wav writing in progress).

OOHehir commented 9 months ago

@EDsteve @LIFsCode I've made this change on both master & the PerDebugging branches. It should cut down on the logging requirement.

On another matter, while working on #74 I've noticed that the SD card is accessed from:

WAVFIleWriter (file recording)
EdgeImpulse (via call back in main.cpp) (results recording)
ElocSystem (mount state)
ElocCommands (capacity & free state)
main.cpp (mount, capacity, free space)

From the documentation it seems there is no thread safety imposed on direct SD card access, thread safety only occurs through an upper layer (e.g. filesystem). Could this possibly explain some of the issues?

I feel #74 might reduce some of the direct SD card access as it introduces methods to access (regularly updated) retained states.

LIFsCode commented 9 months ago

@OOHehir that's a good point, eventually the low level access for getting the size & capacity may cause problems I'll check that.

The mounted state isn't a problem, it's read only of a variable.

The writing of log msg, WAV file and CSV should be fine as they are handled through the file system layer.

OOHehir commented 9 months ago

@LIFsCode I should have mentioned also that before files are written, there's a check of available free space, which involves direct SD card access.

LIFsCode commented 9 months ago

Summary in between:

2 different causes lead to this crash (both relatec to BT activity)

The behavior of the Accel_trigger Pin (whould be kept low via ESP32 internal Pull-down) seems to get broken, once BT gets inactive. This is most likely because of the ELOC entering light sleep mode --> I check the power management on the IOs
The ELOC crashes with spinlock and or other unspecific causes: This I don't understand yet. Could be related to memory (stack)
- Some info material about this:
- Stack issue: https://stackoverflow.com/questions/73174790/esp-idf-assert-failed-spinlock-acquire
- SPI RAM related? https://github.com/espressif/esp-idf/issues/9120

Identification for the 2 causes:

Always shows "Interrupt wdt timeout on CPU0" (or CPU1). This is due to an interrupt storm caused by the Accel_trigger signal
Unclear, I have seen different panic messages and can't tell if they are separate issues or different appearances of the same issue

To be continued...

LIFsCode commented 9 months ago

@EDsteve could you please double check that the Accel_trigger signal behavior is as on the oscilloscope screenshots.

I just tried to reproduce it her and I don't see anything like that. Also the datasheet of the LIS3DH mentioned the INT1 pin to be a push-pull driver which means it should be actively driven low when no knock is detected.

Do you see this behavior on multiple HW boards? @OOHehir could you reproduce this crash, ELOC idle (no record, no sampling, no detection) wait till BT turns off --> WDT crash

I just want to make sure that it's not a HW fault, e.g. bad soldering. @EDsteve could you measure the resistance (Unpowered ELOC) between accel_trigger and GND & VCC

EDsteve commented 9 months ago

@LIFsCode I have an ELOC still trapped in the WDT reboot loop. Let me know if you have some test ideas :)

Here some results i found out today (This commit: 1bb5f54d11c521781202bb5c5db96093dc178cbb. No recording or detecting):

Accel Trigger Pin signal

Hardware: The tests yesterday were made with ELOC 3.4. The tests today show the same problem using ELOC 3.2.
Resistance measurements on five ELOCs (including the one from yesterdays test):

Resistance between Accel_trigger and GND: 700KOhm
Resistance between Accel_trigger and VCC: 700KOhm

Test results: After i have done a few tests i realized, that the accel pin stays high longer, only when i knock a few times on the ELOC right after (or during) BT timed out. Didn't realize that yesterday. This is the log from today for the reboot with knocking which is shown in the screenshot: WDT Reboot loop.txt

with knocking

If i don't knock on the ELOC. The accel pin is always low, but the ELOC still reboots with WDT.
without knocking

Sometimes it shows this:

I (26849) BluetoothServer: 20 seconds without bluetooth connection, exceeding max. of 20 sec! Shutting down Bluetooth.
W (26861) BT_APPL: bta_dm_disable BTA_DISABLE_DELAY set to 200 ms
I (29201) main: Battery: Voltage: 3.379V, 100% SoC, Temp 33 °C
I (29204) main: Heap: Min=106836, free=152936 (26%), largestFreeBlock=98304, fragmentation=36%
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0). 

Core  0 register dump:
PC      : 0x4009ba86  PS      : 0x00060c35  A0      : 0x8009ca64  A1      : 0x3ffbbfc0
A2      : 0x3f800878  A3      : 0x4009caa0  A4      : 0x3ffbc000  A5      : 0xffffffff
A6      : 0x3f800874  A7      : 0xfffffffc  A8      : 0x8009ba95  A9      : 0x3ffbbfa0
A10     : 0x3f80087c  A11     : 0xfffffffc  A12     : 0xffffffff  A13     : 0x3ffbc000
A14     : 0x007b661c  A15     : 0x003fffff  SAR     : 0x00000004  EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000  LBEG    : 0x40092a2a  LEND    : 0x40092a35  LCOUNT  : 0x00000000

Backtrace: 0x4009ba83:0x3ffbbfc0 0x4009ca61:0x3ffbbfe0 0x4011ceb3:0x3ffbc000 0x400e9af4:0x3ffbc040 0x401e1e0b:0x3ffbc1f0

  #0  0x4009ba83:0x3ffbbfc0 in block_size at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/heap_tlsf_block_functions.h:74
      (inlined by) block_is_last at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/heap_tlsf_block_functions.h:85
      (inlined by) tlsf_walk_pool at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/heap_tlsf.c:641
  #1  0x4009ca61:0x3ffbbfe0 in multi_heap_get_info_impl at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/multi_heap.c:413
  #2  0x4011ceb3:0x3ffbc000 in brownout_hal_config at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\hal\esp32/brownout_hal.c:21
  #3  0x400e9af4:0x3ffbc040 in Battery::GetInstance() at src/Battery.hpp:83
      (inlined by) Battery::GetInstance() at src/Battery.hpp:82
      (inlined by) app_main at src/main.cpp:912
  #4  0x401e1e0b:0x3ffbc1f0 in esp_timer_impl_init at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\esp_timer\src/esp_timer_impl_lac.c:246

Core  1 register dump:
PC      : 0x4009493e  PS      : 0x00060535  A0      : 0x800958de  A1      : 0x3ffbd070
A2      : 0x3ffcc460  A3      : 0x00000001  A4      : 0x00060520  A5      : 0x00060523
A6      : 0x00060520  A7      : 0x00000001  A8      : 0x00000001  A9      : 0x3ffbd060
A10     : 0x00000003  A11     : 0x00060523  A12     : 0x00060520  A13     : 0x00060123
A14     : 0x007bec18  A15     : 0x003fffff  SAR     : 0x0000001d  EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000  LBEG    : 0x40097111  LEND    : 0x40097124  LCOUNT  : 0x00000016

Backtrace: 0x4009493b:0x3ffbd070 0x400958db:0x3ffbd090 0x40097279:0x3ffbd0b0

  #0  0x4009493b:0x3ffbd070 in esp_pm_impl_waiti at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\esp_pm/pm_impl.c:827
  #1  0x400958db:0x3ffbd090 in esp_vApplicationIdleHook at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\esp_system/freertos_hooks.c:60
  #2  0x40097279:0x3ffbd0b0 in prvCheckTasksWaitingTermination at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\freertos/tasks.c:4247
      (inlined by) prvIdleTask at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\freertos/tasks.c:3944

ELF file SHA256: dc84717d1c4edc89

Rebooting...
ets Jul 29 2019 12:21:46

rst:0xc (SW_CPU_RESET),boot:0x3b (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:6660
load:0x40078000,len:14944
ho 0 tail 12 room 4
load:0x40080400,len:3816
entry 0x40080698
I (29) boot: ESP-IDF 4.4.4 2nd stage bootloader
I (29) boot: compile time 16:32:23
I (29) boot: chip revision: v3.0
I (32) boot_comm: chip revision: 3, min. bootloader chip revision: 0
I (39) boot.esp32: SPI Speed      : 40MHz
I (44) boot.esp32: SPI Mode       : DIO
I (48) boot.esp32: SPI Flash Size : 16MB
I (53) boot: Enabling RNG early entropy source...
I (58) boot: Partition Table:

And sometimes just this:

I (27827) BluetoothServer: 20 seconds without bluetooth connection, exceeding max. of 20 sec! Shutting down Bluetooth.
W (27839) BT_APPL: bta_dm_disable BTA_DISABLE_DELAY set to 200 ms
I (30203) main: Battery: Voltage: 3.379V, 100% SoC, Temp 33 °C
I (30206) main: Heap: Min=106836, free=152936 (26%), largestFreeBlock=98304, fragmentation=36%
ets Jul 29 2019 12:21:46

rst:0x8 (TG1WDT_SYS_RESET),boot:0x3b (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:6660
load:0x40078000,len:14944
ho 0 tail 12 room 4
load:0x40080400,len:3816
entry 0x40080698
I (58) boot: ESP-IDF 4.4.4 2nd stage bootloader
I (58) boot: compile time 16:32:23
I (58) boot: chip revision: v3.0
I (61) boot_comm: chip revision: 3, min. bootloader chip revision: 0
I (68) boot.esp32: SPI Speed      : 40MHz
I (73) boot.esp32: SPI Mode       : DIO
I (78) boot.esp32: SPI Flash Size : 16MB
W (82) boot.esp32: PRO CPU has been reset by WDT.
W (88) boot.esp32: WDT reset info: PRO CPU PC=0x4011ad7c
W (94) boot.esp32: WDT reset info: APP CPU PC=0x4011ad4f
I (100) boot: Enabling RNG early entropy source...
I (105) boot: Partition Table:

EDIT: added the commit

OOHehir commented 9 months ago

@EDsteve could you please double check that the Accel_trigger signal behavior is as on the oscilloscope screenshots.

I just tried to reproduce it her and I don't see anything like that. Also the datasheet of the LIS3DH mentioned the INT1 pin to be a push-pull driver which means it should be actively driven low when no knock is detected.

Do you see this behavior on multiple HW boards? @OOHehir could you reproduce this crash, ELOC idle (no record, no sampling, no detection) wait till BT turns off --> WDT crash

I just want to make sure that it's not a HW fault, e.g. bad soldering. @EDsteve could you measure the resistance (Unpowered ELOC) between accel_trigger and GND & VCC

@LIFsCode I've just tried on my setup the follow: Master branch, latest commit (ac2d9c6), matched the voltage I see on the log provided by @EDsteve (3.381V), ELOC idle (no record, no sampling, no detection) & so far haven't seen any WDT crash. I note when the BT is shutoff there's a voltage spike, up to around 300mA, graph below. I try a few more runs..

@EDsteve Which build of the project are you using? The EI or non-EI version?

BT_power_off

LIFsCode commented 9 months ago

@EDsteve, @OOHehir: I just have an idea which could be the cause: lis3dh config latch option to false

lis3dh_int_click_config_t lis3dh_click_config = {
    /* NOTE: if double interrupt is used single clicks will be ignored (must be detected manually)*/
    .x_single = false,   // x-axis single tap interrupt enabled
    .x_double = false,   // x-axis double tap interrupt enabled
    .y_single = false,   // y-axis single tap interrupt enabled
    .y_double = false,   // y-axis double tap interrupt enabled
    .z_single = false,   // z-axis single tap interrupt enabled
    .z_double = true,    // z-axis double tap interrupt enabled
    .threshold = 30,     /* threshold which is used by the system to start the click-detection procedure */
    .latch = false,       /* true:  the interrupt is kept high until CLICK_SRC (39h) is read
                          * false: the interrupt is kept high for the duration of the latency window.  */
    .time_limit   = 10,  /* define the maximum time interval that can elapse between the start of 
                          * the click-detection procedure (the acceleration on the selected channel exceeds the 
                          * programmed threshold) and when the acceleration falls back below the threshold. */
    .time_latency = 20,  /* define the time interval that starts after the first click detection where 
                          * the click-detection procedure is disabled, in cases where the device is configured for 
                          * double-click detection */
    .time_window  = 255, /* define the maximum interval of time that can elapse after the end of the 
                          * latency interval in which the click-detection procedure can start, in cases where the device 
                          * is configured for double-click detection. */
};

But that requires additional config for interrupt duration. I'll implement that this evening

EDsteve commented 9 months ago

@OOHehir I am using esp32dev-ei-windows

OOHehir commented 9 months ago

@EDsteve Thanks, I'll try a few more runs but can't reproduce so far..

EDsteve commented 9 months ago

@OOHehir @LIFsCode Also for me difficult to get WDT reboots to show up today. Since hours of trying only one Guru Reboot (While RecordingON_DetectingOFF) and now again a WDT reboot loop. It happened while i was fiddeling around with the ELOC to readjust the UART wires. Getting a few Brownouts though which i didn't see before. Is that only me?

@OOHehir The power consumption you measured is 50mA and up. Is that the power consumption of the ELOC? I measure much lower Amps. (For example: 6mA after BT shuts down) Nice graphs. Which Osci is that?

OOHehir commented 9 months ago

@EDsteve @LIFsCode I've been running the board at higher voltage, showing on terminal output as between 4v & 5v. When running at around 3.9v I observed one crash at startup but without a brownout annotation. I wonder could some of the crashes be power related?

@EDsteve It's from the Nordic power profiler

OOHehir commented 9 months ago

@LIFsCode @EDsteve The reason I think crash at BT shutdown could be power related is the current spike observed & also I see brownout_hal_config.c listed in one of the backtraces.

LIFsCode commented 9 months ago

@EDsteve @OOHehir i changed the [interrupt of the LIS3DH to not latch] (https://github.com/LIFsCode/ELOC-3.0/commit/5be2c6ad9d12792314d14bc60d0bdd001af29bf4). @EDsteve could you test with this commit to see if this changes the situation?

EDsteve commented 9 months ago

@LIFsCode Good find :) Your last commit removes the "stuck on high" problem of the accel_trigger pin. Looks perfect now 👍 :)

Time for more testing with the new commit.

@OOHehir Brownout: When the battery switch is set to LiFePo, the battery feeds directly into the 3.3V line without a voltage regulator (So be carefull using that switch). The only limiting factor i can see are the two switches (ON/OFF and Bat-Switch, the power must run through. They are both rated for 0.5A@12 and 1A@6V. Might not be enough?

EDsteve commented 9 months ago

Let's hope the last commit from @LIFsCode has solved the WDT reboots. We will find out the next days during more tests :)

Let's fight the other reboots. From now on I will only test with the PerfDebugging branch. In my experience i get much more reboots with recordON_detectOFF. So I mostly test in this mode. During my tests today i got two reboots with this error. Haven't seen tha one before so i post it:

Error: assert failed: block_merge_prev IDF\components\heap\heap_tlsf.c:344 (block_is_free(prev) && "prev block is not free though marked as such")
Commit: 104111ec78455a1b4b247d9c2b1ef64caeaf7362
Log: assert failed block_merge_prev - 104111ec - reconly.txt

Happens right after BT time out.

OOHehir commented 8 months ago

@EDsteve @LIFsCode I think I've seen something like this before. It happened when some service (e.g. WAVWriter or Inference) was being terminated & while this was underway its associated thread ran.

OOHehir commented 8 months ago

@EDsteve Thanks, I'll check the switch, perhaps I was too quick to jump to conclusions! Average current seems to be well below those switch ratings so don't think it's an issue.

LIFsCode commented 8 months ago

Let's hope the last commit from @LIFsCode has solved the WDT reboots. We will find out the next days during more tests :)

Let's fight the other reboots. From now on I will only test with the PerfDebugging branch. In my experience i get much more reboots with recordON_detectOFF. So I mostly test in this mode. During my tests today i got two reboots with this error. Haven't seen tha one before so i post it:
* Error: `assert failed: block_merge_prev IDF\components\heap\heap_tlsf.c:344 (block_is_free(prev) && "prev block is not free though marked as such")`

* Commit: [104111e](https://github.com/LIFsCode/ELOC-3.0/commit/104111ec78455a1b4b247d9c2b1ef64caeaf7362)

* Log: [assert failed block_merge_prev - 104111ec - reconly.txt](https://github.com/LIFsCode/ELOC-3.0/files/14211630/assert.failed.block_merge_prev.-.104111ec.-.reconly.txt)
Happens right after BT time out.

@EDsteve @OOHehir I'm pretty sure this is related to a heap or stack corruption. Heap and stack are adjacent to each other, so stack overflow could result in heap corruption. I just came across this topic, though I'm not sure if its related.

@EDsteve: now that the WDT issues are resolved, could you maybe rerun your tests with "logToSdCard" : false. Please note that this config change does only get in effect after a reboot. I just want to make sure that it is not due to the additional log file handling which also affects created tasks of the Bluetooth library, which may have a very limited task specific stack size. This would also keep out any influence on multiple tasks working on the SD card filesystem.

Actually if I track back the trace I end up with this function: Maybe the use of trace messages is causing stack problems here?

LIFsCode commented 8 months ago

@EDsteve , @OOHehir I added additional heap & stack debugging options and this gives me a little more informations, though I still have no idea about the root cause HeapPoisoning.txt

As seen in the log, part of the free unallocated sections in the heap got overwritten by 1 byte: CORRUPT HEAP: Bad tail at 0x3ffe1f98. Expected 0xbaad5678 got 0xbaad5602

This could indicate that one function (I cannot tell which one) is accessing invalid memory. Even though the assert occurs during WAVFileWriter::finish() doesn't meen that function is causing the error. The canary bytes are only checked durng allocation and freeing of heap memory.

@EDsteve could you use c9cc821 or even better 4bae8bb for future tests to get better info from the heap integrity. Thanks

EDIT: I managed to trigger the fault 3 times:

2 times identical: HeapPoisoning.txt
1 time: HeapPoisoning_BT.txt This evan has more than 1 Byte corrupted outside of the allocated memory

I increased heap debugging level to comprehensive with 4bae8bb and this gives me a bit more details:

HeapPoisoning_Comprehensive.txt

Interesting thinges to note:

BT never was switched off

I see the first heap integrity check fail right after the start of recording (BT still connected)

CORRUPT HEAP: Invalid data at 0x3ffe3640. Expected 0xfefefefe got 0xfefefe02
E (64251) main: Heap integrity check failed!

Even though the heap is corrupted, the program continues to run as the respective heap segments are not freed or reallocated. The crash occurs later.
It is the same '02' Byte which has already been seen with the crashs during WAVFileWriter::finish()

OOHehir commented 8 months ago

@LIFsCode @EDsteve Possibly related? Which leads here.

I'm using

framework-arduinoespressif32 @ 3.20007.0 (2.0.7)
framework-espidf @ 3.40404.0 (4.4.4)
tool-cmake @ 3.16.4
tool-esptoolpy @ 1.40500.0 (4.5.0)
tool-mkfatfs @ 2.0.1
tool-mklittlefs @ 1.203.210628 (2.3)
tool-mkspiffs @ 2.230.0 (2.30)
tool-ninja @ 1.7.1
toolchain-esp32ulp @ 1.23500.220830 (2.35.0)
toolchain-xtensa-esp32 @ 8.4.0+2021r2-patch5

& can't reproduce.

@LIFsCode do you have any steps to reproduce?

EDsteve commented 8 months ago

@OOHehir @LIFsCode Sorry for the late reply. Too busy lately. Will hopefully have some more time tomorrow or the day after.

OOHehir commented 8 months ago

@LIFsCode I notice in your log the line:

W (70953) BT_HCI: hcif disc complete: hdl 0x80, rsn 0x13

Whereas in my log its:

W (24452) BT_HCI: hci cmd send: disconnect: hdl 0x80, rsn:0x13
W (24544) BT_HCI: hcif disc complete: hdl 0x80, rsn 0x16

From here rsn 0x13 is Other End Terminated Connection: User Ended Connection Whereas rsn 0x16 is Connection Terminated By Local Host

How are you terminating the BT connection? I'm disconnecting from an Android phone.

LIFsCode commented 8 months ago

@OOHehir do you also run 4bae8bb with comprehensive heap check enabled without this message

CORRUPT HEAP: Invalid data at 0x3ffe3640. Expected 0xfefefefe got 0xfefefe02
E (64251) main: Heap integrity check failed!

All I require for this is to issue a setRecordMode#mode=recordOn_DetectOff command

I use a Windows PC with a serial BT terminal and only disconnect via terminal. So I guess this is different than android. But since the heap corruption occurs before the disconnect I guess it's not relevant.

OOHehir commented 8 months ago

@LIFsCode Yes, using that commit:

git log
commit 4bae8bbc068919e5f4511ad5128bdfbb191e9dad (HEAD -> PerfDebugging, origin/PerfDebugging)
Author: LIFsCode <109753539+LIFsCode@users.noreply.github.com>
Date:   Sat Feb 10 17:31:55 2024 +0100

    DO NOT USE for production: Set heap checking to comprehensive

    * Note this will add significant performance impact

    helps to debug #66

I'm using the esp32dev-linux task. I've tried that command, no issues I can see. Full log, excerpt:


        -----------------------------------
I (15765) BtCmds: setRecordMode requested recordON_detectOFF
I (15770) BtCmds: setRecordMode now recordOn_detectOff(2)
I (15774) BtCmds: wav_writer mode = continuous
I (15777) BtCmds: ai mode = OFF
I (16396) I2S: APLL expected frequency is 8192000 Hz, real frequency is 8191999 Hz
I (16406) I2S: DMA Malloc info, datalen=blocksize=4092, dma_buf_count=8
I (16718) SDC: SD card free space: 14.318619 GB
I (16725) SDC: SD card free space: 15014160 KiB
I (16736) main: Creating session folder /sdcard/eloc/not_set1707771267559
I (16827) main: Starting session with this config:
:```

OOHehir commented 8 months ago

@LIFsCode Just thinking about this now.. I think you probably changed something using menuconfig? I don't think these will transfer across as I already have a sdkconfig.esp32dev-ei* & its in gitignore.

OOHehir commented 8 months ago

@LIFsCode While using recordOn_detectOff I did get this crash:

I (938702) main: Battery: Voltage: 4.123V, 92% SoC, Temp 31 °C
I (938727) main: Heap: Min=26756, free=42316 (78%), largestFreeBlock=34816, fragmentation=18%
I (938948) main: PSRAM Heap: Min=4156475, free=4165187 (1%), largestFreeBlock=4128768, fragmentation=1%
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x40099881  PS      : 0x00060033  A0      : 0x80098ff5  A1      : 0x3fff5a10  
A2      : 0x3fff6e64  A3      : 0x3ffd0584  A4      : 0x3ffd0584  A5      : 0x00060023  
A6      : 0x00060020  A7      : 0x00000001  A8      : 0x00760063  A9      : 0x00640043  
A10     : 0x005d0086  A11     : 0x00000000  A12     : 0x00000000  A13     : 0x3ffeca5c  
A14     : 0x007bec18  A15     : 0x003fffff  SAR     : 0x0000001f  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00760067  LBEG    : 0x400928e0  LEND    : 0x400928fc  LCOUNT  : 0x00000000  

Backtrace: 0x4009987e:0x3fff5a10 0x40098ff2:0x3fff5a30 0x400f7b49:0x3fff5a50 0x400f7cef:0x3fff5aa0

  #0  0x4009987e:0x3fff5a10 in uxListRemove at /home/owen/.platformio/packages/framework-espidf@3.40404.0/components/freertos/list.c:192
  #1  0x40098ff2:0x3fff5a30 in xTaskGenericNotify at /home/owen/.platformio/packages/framework-espidf@3.40404.0/components/freertos/tasks.c:5609
  #2  0x400f7b49:0x3fff5a50 in I2SMEMSSampler::read() at lib/audio_input/src/I2SMEMSSampler.cpp:297
  #3  0x400f7cef:0x3fff5aa0 in I2SMEMSSampler::start_read_thread() at lib/audio_input/src/I2SMEMSSampler.cpp:403
      (inlined by) I2SMEMSSampler::start_read_thread_wrapper(void*) at lib/audio_input/src/I2SMEMSSampler.cpp:414

Which ends up at

                    inference->buf_ready = 1;
                    if (ei_TaskHandler != NULL)
                        xTaskNotify(ei_TaskHandler, (0), eNoAction);
                    }

ei_TaskHandler is created as:

TaskHandle_t ei_TaskHandler = nullptr;      // Task handler from I2S to AI inference

& this is only used once the ei_thread is created:

int ret = xTaskCreatePinnedToCore(this->start_ei_thread_wrapper, "ei_thread", 1024 * 4, this, TASK_PRIO_AI, &ei_TaskHandler, TASK_AI_CORE);

Could the crash sequence be:

System startup
ei_TaskHandler initially NULL
I2S starts sampling, initially ignores ei_TaskHandler as its NULL
EI task started, ei_TaskHandler becomes not NULL
I2S continues sampling, now writing to ei_TaskHandler as not NULL
EI task stopped, but ei_TaskHandler never set to NULL, points to 'somewhere'
I2S continues sampling, now writing to 'somewhere' & eventually crashes the system

Does this seem possible?

LIFsCode commented 8 months ago

@LIFsCode Just thinking about this now.. I think you probably changed something using menuconfig? I don't think these will transfer across as I already have a sdkconfig.esp32dev-ei* & its in gitignore.

@OOHehir Yes I set the CONFIG_HEAP_POISIONING_COMPREHENSIVE=y option. But I changed it in the sdkconfig.defaults so it should used for your build as well. If not we have to rethink the sdkconfig Do you maybe have to do a clean build to update this?

OOHehir commented 8 months ago

@LIFsCode because I already had a valid sdkconfig for the task it doesn't regenerate it from sdkconfig.defaults. I deleted the applicable sdkconfig & tested, can't reproduce so far. terminal_output_heap_config.txt

LIFsCode commented 8 months ago

@OOHehir @EDsteve Update on this. I did all these tests with a new commit (93ff97b) for debugging. I created a new branch for it, @OOHehir it realized too late you pushed to perfDebugging, and I don't wont to mess up the addresses since they might be important for debugging. We can clean up the various branches once we found all errors.

I added a lot of heap traces and heap integrity checks however I couldn't get the monitor to decode the addresses of the callers. Only the backtrace when the crash occured is decoded. the log files below contain theses traces. Additionally I add the option to output a .map file for debugging. Uploaded my version here: firmware.map.zip Don't get confused by the name This can be viewed with AMAP

My setup

 - framework-arduinoespressif32 @ 3.20007.0 (2.0.7)
 - framework-espidf @ 3.40404.0 (4.4.4)
 - tool-cmake @ 3.16.4
 - tool-esptoolpy @ 1.40500.0 (4.5.0)
 - tool-idf @ 1.0.1
 - tool-mconf @ 1.4060000.20190628 (406.0.0)
 - tool-ninja @ 1.9.0
 - toolchain-esp32ulp @ 1.23500.220830 (2.35.0)
 - toolchain-xtensa-esp32 @ 8.4.0+2021r2-patch5

I use windows serial bluetooth terminal in most cases ,but I did a test with Android BT terminal as well
2 GB SD card (slow class don't know what)
esp32dev-ei target
Config: config.txt

Steps to reproduce

Clean start
Connect via BT
Issue command setRecordMode#mode=recordOn_detectOff
Wait for heap to get corrupted (usually within the first bunch of samples written to sd card, I have a 100% error rate for reproducing)

Results

I have various traces with the above steps and they fail at different places with heap integrity checks but always with the same invalid data: CORRUPT HEAP: Invalid data at 0x3ffe3260. Expected 0xfefefefe got 0xfefefe02 Note its always 02
I also did tests with an Android terminal HeapTrace_Android.txt
I did tests with BT connected, but starting the the recording via push button. This does not cause any heap corruption: StartRecordViaButton_BTconnected.txt
I started the test (BT connected) via push button, stopped it, and then started it again via BT command --> crash: StartRecordViaButton_BTconnected_StartedViaBT.txt
I started Rec & Detect via BT --> stop recording via button --> stop rec & detect via BT --> crash. This sceanario is new and interestingly enough it fails in the same way. Except that it does not fail when detection & recording are running at the same time. This sceario also produces a lot of other errors, I assume it's a sideeffect of the heap itegrity checks and the heap tracing.StopRecordViaBT.txt

Interpretation

It is clearly an invalid memory access which causes the heap poisioning to get corrupted. Either by a task stack overflow (task stacks are also on the heap) or by an invalid memory access
Since the crash is not triggered when started via push button, even if a BT command (e.g. getStatus) is issued later, this is a strong hint that it has something to do with the BT command handling
No difference between Android nor Windows terminal
The corrupted data is always identical
The code segment where the heap corruption is detected varies --> this may indicate that the heap is not corrupted within main context but rather within another thread context.
The AI & Recording run error free at the same time (see results 5)
Stopping may also cause this error if AI and recording are running before (result 5)

This gives us the following 2 conditions which are required:

Recording only (no AI running)
Start recording command via BT

OOHehir commented 8 months ago

@LIFsCode Some nice investigation work. Based on it I think we may be a bit further in find the bug. Selectively commenting out this line triggers/ prevents the crash.

@EDsteve @LIFsCode I have to put my hand up & say that this looks like its my error, apologies. When I carried out this commit I didn't correct the enum scope. As it was hidden by #include guards it wasn't flagged in my IDE as an error but it still successfully compiled, even when the edge impulse feature was enabled. I've temporarily added

#define EDGE_IMPULSE_ENABLED

in project_config.h to improve visibility.

The PerfDebugging branch is still crashing when this command is issued

setRecordMode#mode=recordOn_detectOff

This is quite a change on my side, in previous builds this didn't occur, even before the correction above. I'm not sure what the change is that's causing this, but I'm getting a lot of new errors, perhaps due to the extra memory & overhead with the heap tracing.

Once again apologies for the considerable time this bug has caused.

LIFsCode / ELOC-3.0