Closed EDsteve closed 8 months ago
@EDsteve @LIFsCode This could be perhaps a memory exhaustion issue, I made some changes as discussed, it may solve this issue also.
@LIFsCode @OOHehir
I am adding more reboot logs here, which hopefully give more insight to find the problem.
@EDsteve Thanks ED. If you can use the method to trace where in the code the crashes occur it would really help. You might remember we did this before, here's a link to the documentation.
@OOHehir Okay. I see what is going on now. The VScode Serial monitor doesn't do backtrace to the files (which i have used lately). But the PlatformIO Monitor does.
@EDsteve @OOHehir this looks like some heap or heap fragmentation problem or a stack overflow which then overwrites the subsequent heap sections.
I try to add some additional debugging info and try to reproduce it at home
@LIFsCode Forgot to mention. That this does only happen occasionaly. Haven't really seen a pattern, but i suspect it mostly happens with the first BT timout after a reboot. Not sure if that can make sense.
@OOHehir @LIFsCode
I catched another Guro error with Backtrace. This time in a loop:
EDIT: The buzzer wakes up the acelerometer every time it buzzes. So when 60 sec time out happens, the buzzer wakes up the knock sensor at the same time whcih also wakes up BT again. Only on this specific PCB the knock sensor wakes up from the buzzer. But maybe that is the reason for the reboot loop? Will play around with the accelerometer sensitivity to find a better value.
@EDsteve @LIFsCode I just had a quick look at this log & perhaps its a memory issue also? There seems to be lots occuring simultaneously; file rotation, ArduinoJSON memory pool, file writing & a semaphore queue being established. That's just an educated guess from first reading though..
@LIFsCode @OOHehir
Another Guru reboot. Do more log files help you to figure out where the problem comes from? Or should i stop uploading them?
From another device i am getting BT errors. But no reboot and still working normal. Not sure if that is connected to the reboots. But i will post it anyway in case it helps:
E (133702274) BT_OSI: config_save, err_code: 0x4
@LIFsCode @OOHehir
Another Guru reboot.
Do more log files help you to figure out where the problem comes from? Or should i stop uploading them?
Guru Reboot after interactin with BT
How it happens: ELOC was in recOn_detectON mode for >24 hours, and after one of my rabdom BT-test-connections it rebooted
Commit: d69080d9ce6eb05b3a9967dc8f74919bccfcbef0
Config: stamptest1706795112761.txt
Log (With backtrace): Guru.txt
This seems a bit different:
Core 0 panic'ed (Interrupt wdt timeout on CPU0)
From another device i am getting BT errors. But no reboot and still working normal. Not sure if that is connected to the reboots. But i will post it anyway in case it helps:
E (133702274) BT_OSI: config_save, err_code: 0x4
BT Error in log
* How it happened: ELOC was in recOn_detectON mode for >24 hours, and after one of my random BT-test-connections this error appeares now every time BT turns on or off. But the ELOC app can connect without problems. * Commit: [d69080d](https://github.com/LIFsCode/ELOC-3.0/commit/d69080d9ce6eb05b3a9967dc8f74919bccfcbef0) * Here the log [BT-Error.txt](https://github.com/LIFsCode/ELOC-3.0/files/14143010/BT-Error.txt)
I don't see this error code in the documentation but perhaps a wrapper for this?
@LIFsCode @OOHehir
Another Guru reboot. Do more log files help you to figure out where the problem comes from? Or should i stop uploading them?
Guru Reboot after interactin with BT
* How it happens: ELOC was in recOn_detectON mode for >24 hours, and after one of my rabdom BT-test-connections it rebooted * Commit: [d69080d](https://github.com/LIFsCode/ELOC-3.0/commit/d69080d9ce6eb05b3a9967dc8f74919bccfcbef0) * Config: [stamptest1706795112761.txt](https://github.com/LIFsCode/ELOC-3.0/files/14142906/stamptest1706795112761.txt) * Log (With backtrace): [Guru.txt](https://github.com/LIFsCode/ELOC-3.0/files/14142820/Guru.after.SD.card.check.txt)
Ok this is a different topic: The Core0 is stucked in an ISR (Interrupt Service Routine).
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).
Core 0 register dump:
PC : 0x4000bfe7 PS : 0x00060835 A0 : 0x80085305 A1 : 0x3ffbe5d0
A2 : 0x00000001 A3 : 0x00060823 A4 : 0xfffffff0 A5 : 0x00060023
A6 : 0x00000000 A7 : 0x3ffdf580 A8 : 0x00000008 A9 : 0x3ffbe4e0
A10 : 0x00000004 A11 : 0x00060024 A12 : 0x00060023 A13 : 0x00060023
A14 : 0x00060023 A15 : 0x00000001 SAR : 0x0000001d EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x400928d8 LEND : 0x400928f4 LCOUNT : 0xffffffff
Core 0 was running in ISR context:
EPC1 : 0x401779a3 EPC2 : 0x400869c6 EPC3 : 0x400869b1 EPC4 : 0x400869b1
The only thing I could image causing this, is the knock sensor, which has an interrupt pin to wake the ELOC. @EDsteve are you knocking on the ELOC? I think maybe the the knock sensor is too sensitive. The interrupt is edge triggered so I assume the pin is toggling like crazy, but I haven't seen it here, so I can't tell for sure.
@EDsteve: if that happens again, could you please measure the "Accel_Trigger" signal on the PCB
Ok, I found something else, which might give another explanation to the WDT timeout: https://esp32.com/viewtopic.php?t=33221
Unfortunatelly I could reproduce the reboot only a single time: WDStuck.txt
This might be related to the above discribed issue, though I'm not sure. I will move the BT core to Core 1 and and have to ask @EDsteve to rerun the tests. Sorry it's frustrating, but I don't have any better idea.
@EDsteve : Do you also have these issues when the SD card logging is disabled? Since this adds a lot of file system handling even to integrated functions (as bluetooth, etc.) this might cause troubles.
I splittet optimizing the sd card logs in another issue this will be a bit more work, not sure if I manage to do it before the next deadline. If disabling the sd card log solved these issues, than that would be at least a workaround
@LIFsCode No worries. I don't mind firmware tests :) Okay. I will run some devices with and some without log and see what is happening. To make sure: I am using the PerfDebugging branch for testing from now on?
Accelerometer Sensitivity: I have raised the sensitivity threshold from 10 to 30. Which still works fine when i knock on the case with the PCB inside. With this new threshold it should not wake up any more when i walk through my house (wooden floor) :)
@LIFsCode @OOHehir
Quick update about my tests: Only recently i used the PerfDebugging branch for tests. So that needs more time for results. But i have done around 30 tests (long term) with these three firmwares during the last two weeks. d2dd26c8f8e37bfb0c3e9861ef266469df3d2101 f4a6a8e6c3d3ceb6e098aaf379cfd48cfb49d683 d69080d9ce6eb05b3a9967dc8f74919bccfcbef0
Last two weeks impressions:
RecON_DetectON - > Reboots happen less frequently. I still have some ELOCs running since >10 days
RecON_DetectOFF -> Reboots happen much more frequently. (Maybe because the serial has a much higher output with constantly writing I2SMEMSSampler: inference buffer overrun
). None of the ELOCs made it longer than 3 days (>6 devices).
Most of the reboots happen after the 60 sec time out from BT.
What might also fuel the reboots is that the knock sensor gets triggered every time the buzzer beeps. I have tried higher frequencies (800Hz and up) for the beeps which does not trigger the knock sensor any more.
Logging ON or OFF does not seem to make a big difference (but more tests are needed)
PerfDebugging branch
Accel_Trigger pin: Fabi seems to be on the right track :) I have measured weird signals (almost constantly at 3.3V) on the Accel_trigger pin. I observed that twice. And it goes high right after the 60 sec BT time out. How is the trigger pin connected to BT? I still have to do more test for a bigger picture. But i am sure you might be able to use that info already.
Accel Pin in Idle mode:
After 60 second BT time out:
Some more logs with PerfDebugging branch: Guru WDT Reboot loop after 60sec time out.txt Assert failed - after 60 sec BT - log off.txt
@EDsteve ok that's pretty good data thanks :) For the BT idle & Accel trigger pin measurement:
I'm sure the issue is related to the power management, i.e. the pin has a SW configured pull down in the ESP, if this pin is turned off it might end up floating. If I2S sampling and BT is inactive the esp will enter light sleep mode automatically and this also affects the pin configs.
I will check that.
Anyway this does not explain the other reboots which appear during recording.
@LIFsCode
@EDsteve @LIFsCode Currently the 'buffer overrun' messages occur even if the AI inference & wav writing is not in progress. To cut down on the output I change it to only display when the respective task is active (i.e inference or wav writing in progress).
@EDsteve @LIFsCode I've made this change on both master & the PerDebugging branches. It should cut down on the logging requirement.
On another matter, while working on #74 I've noticed that the SD card is accessed from:
From the documentation it seems there is no thread safety imposed on direct SD card access, thread safety only occurs through an upper layer (e.g. filesystem). Could this possibly explain some of the issues?
I feel #74 might reduce some of the direct SD card access as it introduces methods to access (regularly updated) retained states.
@OOHehir that's a good point, eventually the low level access for getting the size & capacity may cause problems I'll check that.
The mounted state isn't a problem, it's read only of a variable.
The writing of log msg, WAV file and CSV should be fine as they are handled through the file system layer.
@LIFsCode I should have mentioned also that before files are written, there's a check of available free space, which involves direct SD card access.
2 different causes lead to this crash (both relatec to BT activity)
Identification for the 2 causes:
To be continued...
@EDsteve could you please double check that the Accel_trigger signal behavior is as on the oscilloscope screenshots.
I just tried to reproduce it her and I don't see anything like that. Also the datasheet of the LIS3DH mentioned the INT1 pin to be a push-pull driver which means it should be actively driven low when no knock is detected.
Do you see this behavior on multiple HW boards? @OOHehir could you reproduce this crash, ELOC idle (no record, no sampling, no detection) wait till BT turns off --> WDT crash
I just want to make sure that it's not a HW fault, e.g. bad soldering. @EDsteve could you measure the resistance (Unpowered ELOC) between accel_trigger and GND & VCC
@LIFsCode I have an ELOC still trapped in the WDT reboot loop. Let me know if you have some test ideas :)
Here some results i found out today (This commit: 1bb5f54d11c521781202bb5c5db96093dc178cbb. No recording or detecting):
Hardware: The tests yesterday were made with ELOC 3.4. The tests today show the same problem using ELOC 3.2.
Resistance measurements on five ELOCs (including the one from yesterdays test):
Test results: After i have done a few tests i realized, that the accel pin stays high longer, only when i knock a few times on the ELOC right after (or during) BT timed out. Didn't realize that yesterday. This is the log from today for the reboot with knocking which is shown in the screenshot: WDT Reboot loop.txt
If i don't knock on the ELOC. The accel pin is always low, but the ELOC still reboots with WDT.
Sometimes it shows this:
I (26849) BluetoothServer: 20 seconds without bluetooth connection, exceeding max. of 20 sec! Shutting down Bluetooth.
W (26861) BT_APPL: bta_dm_disable BTA_DISABLE_DELAY set to 200 ms
I (29201) main: Battery: Voltage: 3.379V, 100% SoC, Temp 33 °C
I (29204) main: Heap: Min=106836, free=152936 (26%), largestFreeBlock=98304, fragmentation=36%
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).
Core 0 register dump:
PC : 0x4009ba86 PS : 0x00060c35 A0 : 0x8009ca64 A1 : 0x3ffbbfc0
A2 : 0x3f800878 A3 : 0x4009caa0 A4 : 0x3ffbc000 A5 : 0xffffffff
A6 : 0x3f800874 A7 : 0xfffffffc A8 : 0x8009ba95 A9 : 0x3ffbbfa0
A10 : 0x3f80087c A11 : 0xfffffffc A12 : 0xffffffff A13 : 0x3ffbc000
A14 : 0x007b661c A15 : 0x003fffff SAR : 0x00000004 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x40092a2a LEND : 0x40092a35 LCOUNT : 0x00000000
Backtrace: 0x4009ba83:0x3ffbbfc0 0x4009ca61:0x3ffbbfe0 0x4011ceb3:0x3ffbc000 0x400e9af4:0x3ffbc040 0x401e1e0b:0x3ffbc1f0
#0 0x4009ba83:0x3ffbbfc0 in block_size at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/heap_tlsf_block_functions.h:74
(inlined by) block_is_last at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/heap_tlsf_block_functions.h:85
(inlined by) tlsf_walk_pool at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/heap_tlsf.c:641
#1 0x4009ca61:0x3ffbbfe0 in multi_heap_get_info_impl at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\heap/multi_heap.c:413
#2 0x4011ceb3:0x3ffbc000 in brownout_hal_config at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\hal\esp32/brownout_hal.c:21
#3 0x400e9af4:0x3ffbc040 in Battery::GetInstance() at src/Battery.hpp:83
(inlined by) Battery::GetInstance() at src/Battery.hpp:82
(inlined by) app_main at src/main.cpp:912
#4 0x401e1e0b:0x3ffbc1f0 in esp_timer_impl_init at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\esp_timer\src/esp_timer_impl_lac.c:246
Core 1 register dump:
PC : 0x4009493e PS : 0x00060535 A0 : 0x800958de A1 : 0x3ffbd070
A2 : 0x3ffcc460 A3 : 0x00000001 A4 : 0x00060520 A5 : 0x00060523
A6 : 0x00060520 A7 : 0x00000001 A8 : 0x00000001 A9 : 0x3ffbd060
A10 : 0x00000003 A11 : 0x00060523 A12 : 0x00060520 A13 : 0x00060123
A14 : 0x007bec18 A15 : 0x003fffff SAR : 0x0000001d EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x40097111 LEND : 0x40097124 LCOUNT : 0x00000016
Backtrace: 0x4009493b:0x3ffbd070 0x400958db:0x3ffbd090 0x40097279:0x3ffbd0b0
#0 0x4009493b:0x3ffbd070 in esp_pm_impl_waiti at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\esp_pm/pm_impl.c:827
#1 0x400958db:0x3ffbd090 in esp_vApplicationIdleHook at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\esp_system/freertos_hooks.c:60
#2 0x40097279:0x3ffbd0b0 in prvCheckTasksWaitingTermination at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\freertos/tasks.c:4247
(inlined by) prvIdleTask at C:\Users\edste\.platformio\packages\framework-espidf@3.40404.0\components\freertos/tasks.c:3944
ELF file SHA256: dc84717d1c4edc89
Rebooting...
ets Jul 29 2019 12:21:46
rst:0xc (SW_CPU_RESET),boot:0x3b (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:6660
load:0x40078000,len:14944
ho 0 tail 12 room 4
load:0x40080400,len:3816
entry 0x40080698
I (29) boot: ESP-IDF 4.4.4 2nd stage bootloader
I (29) boot: compile time 16:32:23
I (29) boot: chip revision: v3.0
I (32) boot_comm: chip revision: 3, min. bootloader chip revision: 0
I (39) boot.esp32: SPI Speed : 40MHz
I (44) boot.esp32: SPI Mode : DIO
I (48) boot.esp32: SPI Flash Size : 16MB
I (53) boot: Enabling RNG early entropy source...
I (58) boot: Partition Table:
And sometimes just this:
I (27827) BluetoothServer: 20 seconds without bluetooth connection, exceeding max. of 20 sec! Shutting down Bluetooth.
W (27839) BT_APPL: bta_dm_disable BTA_DISABLE_DELAY set to 200 ms
I (30203) main: Battery: Voltage: 3.379V, 100% SoC, Temp 33 °C
I (30206) main: Heap: Min=106836, free=152936 (26%), largestFreeBlock=98304, fragmentation=36%
ets Jul 29 2019 12:21:46
rst:0x8 (TG1WDT_SYS_RESET),boot:0x3b (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:6660
load:0x40078000,len:14944
ho 0 tail 12 room 4
load:0x40080400,len:3816
entry 0x40080698
I (58) boot: ESP-IDF 4.4.4 2nd stage bootloader
I (58) boot: compile time 16:32:23
I (58) boot: chip revision: v3.0
I (61) boot_comm: chip revision: 3, min. bootloader chip revision: 0
I (68) boot.esp32: SPI Speed : 40MHz
I (73) boot.esp32: SPI Mode : DIO
I (78) boot.esp32: SPI Flash Size : 16MB
W (82) boot.esp32: PRO CPU has been reset by WDT.
W (88) boot.esp32: WDT reset info: PRO CPU PC=0x4011ad7c
W (94) boot.esp32: WDT reset info: APP CPU PC=0x4011ad4f
I (100) boot: Enabling RNG early entropy source...
I (105) boot: Partition Table:
EDIT: added the commit
@EDsteve could you please double check that the Accel_trigger signal behavior is as on the oscilloscope screenshots.
I just tried to reproduce it her and I don't see anything like that. Also the datasheet of the LIS3DH mentioned the INT1 pin to be a push-pull driver which means it should be actively driven low when no knock is detected.
Do you see this behavior on multiple HW boards? @OOHehir could you reproduce this crash, ELOC idle (no record, no sampling, no detection) wait till BT turns off --> WDT crash
I just want to make sure that it's not a HW fault, e.g. bad soldering. @EDsteve could you measure the resistance (Unpowered ELOC) between accel_trigger and GND & VCC
@LIFsCode I've just tried on my setup the follow: Master branch, latest commit (ac2d9c6), matched the voltage I see on the log provided by @EDsteve (3.381V), ELOC idle (no record, no sampling, no detection) & so far haven't seen any WDT crash. I note when the BT is shutoff there's a voltage spike, up to around 300mA, graph below. I try a few more runs..
@EDsteve Which build of the project are you using? The EI or non-EI version?
@EDsteve, @OOHehir: I just have an idea which could be the cause: lis3dh config latch option to false
lis3dh_int_click_config_t lis3dh_click_config = {
/* NOTE: if double interrupt is used single clicks will be ignored (must be detected manually)*/
.x_single = false, // x-axis single tap interrupt enabled
.x_double = false, // x-axis double tap interrupt enabled
.y_single = false, // y-axis single tap interrupt enabled
.y_double = false, // y-axis double tap interrupt enabled
.z_single = false, // z-axis single tap interrupt enabled
.z_double = true, // z-axis double tap interrupt enabled
.threshold = 30, /* threshold which is used by the system to start the click-detection procedure */
.latch = false, /* true: the interrupt is kept high until CLICK_SRC (39h) is read
* false: the interrupt is kept high for the duration of the latency window. */
.time_limit = 10, /* define the maximum time interval that can elapse between the start of
* the click-detection procedure (the acceleration on the selected channel exceeds the
* programmed threshold) and when the acceleration falls back below the threshold. */
.time_latency = 20, /* define the time interval that starts after the first click detection where
* the click-detection procedure is disabled, in cases where the device is configured for
* double-click detection */
.time_window = 255, /* define the maximum interval of time that can elapse after the end of the
* latency interval in which the click-detection procedure can start, in cases where the device
* is configured for double-click detection. */
};
But that requires additional config for interrupt duration. I'll implement that this evening
@OOHehir I am using esp32dev-ei-windows
@EDsteve Thanks, I'll try a few more runs but can't reproduce so far..
@OOHehir @LIFsCode Also for me difficult to get WDT reboots to show up today. Since hours of trying only one Guru Reboot (While RecordingON_DetectingOFF) and now again a WDT reboot loop. It happened while i was fiddeling around with the ELOC to readjust the UART wires. Getting a few Brownouts though which i didn't see before. Is that only me?
@OOHehir The power consumption you measured is 50mA and up. Is that the power consumption of the ELOC? I measure much lower Amps. (For example: 6mA after BT shuts down) Nice graphs. Which Osci is that?
@EDsteve @LIFsCode I've been running the board at higher voltage, showing on terminal output as between 4v & 5v. When running at around 3.9v I observed one crash at startup but without a brownout annotation. I wonder could some of the crashes be power related?
@EDsteve It's from the Nordic power profiler
@LIFsCode @EDsteve The reason I think crash at BT shutdown could be power related is the current spike observed & also I see brownout_hal_config.c
listed in one of the backtraces.
@EDsteve @OOHehir i changed the [interrupt of the LIS3DH to not latch] (https://github.com/LIFsCode/ELOC-3.0/commit/5be2c6ad9d12792314d14bc60d0bdd001af29bf4). @EDsteve could you test with this commit to see if this changes the situation?
@LIFsCode Good find :) Your last commit removes the "stuck on high" problem of the accel_trigger pin. Looks perfect now 👍 :)
Time for more testing with the new commit.
@OOHehir Brownout: When the battery switch is set to LiFePo, the battery feeds directly into the 3.3V line without a voltage regulator (So be carefull using that switch). The only limiting factor i can see are the two switches (ON/OFF and Bat-Switch, the power must run through. They are both rated for 0.5A@12 and 1A@6V. Might not be enough?
Let's hope the last commit from @LIFsCode has solved the WDT reboots. We will find out the next days during more tests :)
Let's fight the other reboots. From now on I will only test with the PerfDebugging branch. In my experience i get much more reboots with recordON_detectOFF. So I mostly test in this mode. During my tests today i got two reboots with this error. Haven't seen tha one before so i post it:
Error: assert failed: block_merge_prev IDF\components\heap\heap_tlsf.c:344 (block_is_free(prev) && "prev block is not free though marked as such")
Commit: 104111ec78455a1b4b247d9c2b1ef64caeaf7362
Log: assert failed block_merge_prev - 104111ec - reconly.txt
Happens right after BT time out.
@EDsteve @LIFsCode I think I've seen something like this before. It happened when some service (e.g. WAVWriter or Inference) was being terminated & while this was underway its associated thread ran.
@EDsteve Thanks, I'll check the switch, perhaps I was too quick to jump to conclusions! Average current seems to be well below those switch ratings so don't think it's an issue.
Let's hope the last commit from @LIFsCode has solved the WDT reboots. We will find out the next days during more tests :)
Let's fight the other reboots. From now on I will only test with the PerfDebugging branch. In my experience i get much more reboots with recordON_detectOFF. So I mostly test in this mode. During my tests today i got two reboots with this error. Haven't seen tha one before so i post it:
* Error: `assert failed: block_merge_prev IDF\components\heap\heap_tlsf.c:344 (block_is_free(prev) && "prev block is not free though marked as such")` * Commit: [104111e](https://github.com/LIFsCode/ELOC-3.0/commit/104111ec78455a1b4b247d9c2b1ef64caeaf7362) * Log: [assert failed block_merge_prev - 104111ec - reconly.txt](https://github.com/LIFsCode/ELOC-3.0/files/14211630/assert.failed.block_merge_prev.-.104111ec.-.reconly.txt)
Happens right after BT time out.
@EDsteve @OOHehir I'm pretty sure this is related to a heap or stack corruption. Heap and stack are adjacent to each other, so stack overflow could result in heap corruption. I just came across this topic, though I'm not sure if its related.
@EDsteve: now that the WDT issues are resolved, could you maybe rerun your tests with "logToSdCard" : false
. Please note that this config change does only get in effect after a reboot.
I just want to make sure that it is not due to the additional log file handling which also affects created tasks of the Bluetooth library, which may have a very limited task specific stack size.
This would also keep out any influence on multiple tasks working on the SD card filesystem.
Actually if I track back the trace I end up with this function: Maybe the use of trace messages is causing stack problems here?
@EDsteve , @OOHehir I added additional heap & stack debugging options and this gives me a little more informations, though I still have no idea about the root cause HeapPoisoning.txt
As seen in the log, part of the free unallocated sections in the heap got overwritten by 1 byte:
CORRUPT HEAP: Bad tail at 0x3ffe1f98. Expected 0xbaad5678 got 0xbaad5602
This could indicate that one function (I cannot tell which one) is accessing invalid memory. Even though the assert occurs during WAVFileWriter::finish() doesn't meen that function is causing the error. The canary bytes are only checked durng allocation and freeing of heap memory.
@EDsteve could you use c9cc821 or even better 4bae8bb for future tests to get better info from the heap integrity. Thanks
EDIT: I managed to trigger the fault 3 times:
I increased heap debugging level to comprehensive with 4bae8bb and this gives me a bit more details:
HeapPoisoning_Comprehensive.txt
Interesting thinges to note:
CORRUPT HEAP: Invalid data at 0x3ffe3640. Expected 0xfefefefe got 0xfefefe02
E (64251) main: Heap integrity check failed!
@LIFsCode @EDsteve Possibly related? Which leads here.
I'm using
toolchain-xtensa-esp32 @ 8.4.0+2021r2-patch5
& can't reproduce.
@LIFsCode do you have any steps to reproduce?
@OOHehir @LIFsCode Sorry for the late reply. Too busy lately. Will hopefully have some more time tomorrow or the day after.
@LIFsCode I notice in your log the line:
W (70953) BT_HCI: hcif disc complete: hdl 0x80, rsn 0x13
Whereas in my log its:
W (24452) BT_HCI: hci cmd send: disconnect: hdl 0x80, rsn:0x13
W (24544) BT_HCI: hcif disc complete: hdl 0x80, rsn 0x16
From here rsn 0x13 is Other End Terminated Connection: User Ended Connection Whereas rsn 0x16 is Connection Terminated By Local Host
How are you terminating the BT connection? I'm disconnecting from an Android phone.
@OOHehir do you also run 4bae8bb with comprehensive heap check enabled without this message
CORRUPT HEAP: Invalid data at 0x3ffe3640. Expected 0xfefefefe got 0xfefefe02
E (64251) main: Heap integrity check failed!
All I require for this is to issue a setRecordMode#mode=recordOn_DetectOff command
I use a Windows PC with a serial BT terminal and only disconnect via terminal. So I guess this is different than android. But since the heap corruption occurs before the disconnect I guess it's not relevant.
@LIFsCode Yes, using that commit:
git log
commit 4bae8bbc068919e5f4511ad5128bdfbb191e9dad (HEAD -> PerfDebugging, origin/PerfDebugging)
Author: LIFsCode <109753539+LIFsCode@users.noreply.github.com>
Date: Sat Feb 10 17:31:55 2024 +0100
DO NOT USE for production: Set heap checking to comprehensive
* Note this will add significant performance impact
helps to debug #66
I'm using the esp32dev-linux task. I've tried that command, no issues I can see. Full log, excerpt:
-----------------------------------
I (15765) BtCmds: setRecordMode requested recordON_detectOFF
I (15770) BtCmds: setRecordMode now recordOn_detectOff(2)
I (15774) BtCmds: wav_writer mode = continuous
I (15777) BtCmds: ai mode = OFF
I (16396) I2S: APLL expected frequency is 8192000 Hz, real frequency is 8191999 Hz
I (16406) I2S: DMA Malloc info, datalen=blocksize=4092, dma_buf_count=8
I (16718) SDC: SD card free space: 14.318619 GB
I (16725) SDC: SD card free space: 15014160 KiB
I (16736) main: Creating session folder /sdcard/eloc/not_set1707771267559
I (16827) main: Starting session with this config:
:```
@LIFsCode Just thinking about this now.. I think you probably changed something using menuconfig? I don't think these will transfer across as I already have a sdkconfig.esp32dev-ei* & its in gitignore.
@LIFsCode While using recordOn_detectOff
I did get this crash:
I (938702) main: Battery: Voltage: 4.123V, 92% SoC, Temp 31 °C
I (938727) main: Heap: Min=26756, free=42316 (78%), largestFreeBlock=34816, fragmentation=18%
I (938948) main: PSRAM Heap: Min=4156475, free=4165187 (1%), largestFreeBlock=4128768, fragmentation=1%
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x40099881 PS : 0x00060033 A0 : 0x80098ff5 A1 : 0x3fff5a10
A2 : 0x3fff6e64 A3 : 0x3ffd0584 A4 : 0x3ffd0584 A5 : 0x00060023
A6 : 0x00060020 A7 : 0x00000001 A8 : 0x00760063 A9 : 0x00640043
A10 : 0x005d0086 A11 : 0x00000000 A12 : 0x00000000 A13 : 0x3ffeca5c
A14 : 0x007bec18 A15 : 0x003fffff SAR : 0x0000001f EXCCAUSE: 0x0000001c
EXCVADDR: 0x00760067 LBEG : 0x400928e0 LEND : 0x400928fc LCOUNT : 0x00000000
Backtrace: 0x4009987e:0x3fff5a10 0x40098ff2:0x3fff5a30 0x400f7b49:0x3fff5a50 0x400f7cef:0x3fff5aa0
#0 0x4009987e:0x3fff5a10 in uxListRemove at /home/owen/.platformio/packages/framework-espidf@3.40404.0/components/freertos/list.c:192
#1 0x40098ff2:0x3fff5a30 in xTaskGenericNotify at /home/owen/.platformio/packages/framework-espidf@3.40404.0/components/freertos/tasks.c:5609
#2 0x400f7b49:0x3fff5a50 in I2SMEMSSampler::read() at lib/audio_input/src/I2SMEMSSampler.cpp:297
#3 0x400f7cef:0x3fff5aa0 in I2SMEMSSampler::start_read_thread() at lib/audio_input/src/I2SMEMSSampler.cpp:403
(inlined by) I2SMEMSSampler::start_read_thread_wrapper(void*) at lib/audio_input/src/I2SMEMSSampler.cpp:414
Which ends up at
inference->buf_ready = 1;
if (ei_TaskHandler != NULL)
xTaskNotify(ei_TaskHandler, (0), eNoAction);
}
ei_TaskHandler is created as:
TaskHandle_t ei_TaskHandler = nullptr; // Task handler from I2S to AI inference
& this is only used once the ei_thread is created:
int ret = xTaskCreatePinnedToCore(this->start_ei_thread_wrapper, "ei_thread", 1024 * 4, this, TASK_PRIO_AI, &ei_TaskHandler, TASK_AI_CORE);
Could the crash sequence be:
Does this seem possible?
@LIFsCode Just thinking about this now.. I think you probably changed something using menuconfig? I don't think these will transfer across as I already have a sdkconfig.esp32dev-ei* & its in gitignore.
@OOHehir Yes I set the CONFIG_HEAP_POISIONING_COMPREHENSIVE=y option. But I changed it in the sdkconfig.defaults so it should used for your build as well. If not we have to rethink the sdkconfig Do you maybe have to do a clean build to update this?
@LIFsCode because I already had a valid sdkconfig for the task it doesn't regenerate it from sdkconfig.defaults. I deleted the applicable sdkconfig & tested, can't reproduce so far. terminal_output_heap_config.txt
@OOHehir @EDsteve Update on this. I did all these tests with a new commit (93ff97b) for debugging. I created a new branch for it, @OOHehir it realized too late you pushed to perfDebugging, and I don't wont to mess up the addresses since they might be important for debugging. We can clean up the various branches once we found all errors.
I added a lot of heap traces and heap integrity checks however I couldn't get the monitor to decode the addresses of the callers. Only the backtrace when the crash occured is decoded. the log files below contain theses traces. Additionally I add the option to output a .map file for debugging. Uploaded my version here: firmware.map.zip Don't get confused by the name This can be viewed with AMAP
- framework-arduinoespressif32 @ 3.20007.0 (2.0.7)
- framework-espidf @ 3.40404.0 (4.4.4)
- tool-cmake @ 3.16.4
- tool-esptoolpy @ 1.40500.0 (4.5.0)
- tool-idf @ 1.0.1
- tool-mconf @ 1.4060000.20190628 (406.0.0)
- tool-ninja @ 1.9.0
- toolchain-esp32ulp @ 1.23500.220830 (2.35.0)
- toolchain-xtensa-esp32 @ 8.4.0+2021r2-patch5
setRecordMode#mode=recordOn_detectOff
CORRUPT HEAP: Invalid data at 0x3ffe3260. Expected 0xfefefefe got 0xfefefe02
Note its always 02
This gives us the following 2 conditions which are required:
@LIFsCode Some nice investigation work. Based on it I think we may be a bit further in find the bug. Selectively commenting out this line triggers/ prevents the crash.
@EDsteve @LIFsCode I have to put my hand up & say that this looks like its my error, apologies. When I carried out this commit I didn't correct the enum scope. As it was hidden by #include guards it wasn't flagged in my IDE as an error but it still successfully compiled, even when the edge impulse feature was enabled. I've temporarily added
#define EDGE_IMPULSE_ENABLED
in project_config.h to improve visibility.
The PerfDebugging branch is still crashing when this command is issued
setRecordMode#mode=recordOn_detectOff
This is quite a change on my side, in previous builds this didn't occur, even before the correction above. I'm not sure what the change is that's causing this, but I'm getting a lot of new errors, perhaps due to the extra memory & overhead with the heap tracing.
Once again apologies for the considerable time this bug has caused.
@LIFsCode @OOHehir Not sure if these reboots are connected but i think it is because of BT turning OFF or ON.
Situation 1
After a few hours of recordon_detecton i triggered the knock sensor and then it rebooted with Ruru Mediation Error. This happened only once for now. Log files: Using commit cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reboot while recordon_detecton.txt Bootlog after crash.txt
Situation 2
Start the ELOC and just wait. When BT turns off due to the BT time out it restarts. This doesn't happeb all the time but a few times already since a few weeks.
Using commit: 29325f68d7c831e7b754bf81b368c9930524c5eb Reset when BT turns off.txt
Using commit: 8cc4b9949c1cb9bd53aed99f9485e725fab7dcb0 Reset when BT turns off 2.txt
Using commit: cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reset when BT turns off 3.txt