spinlock_acquire assertion fail (IDFGH-10048)

espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.

Apache License 2.0

13.71k stars 7.3k forks source link

spinlock_acquire assertion fail (IDFGH-10048) #11325

Closed dragazo closed 1 year ago

dragazo commented 1 year ago

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

I've been using ESP-IDF v4.4 (but I have updated to the latest release, v4.4-dev-3540-g4e03a9c34c - I can't use anything after v4.4 because websockets were removed) on my ESP32-S3 DevkitC-1 N32R8V and have been having no problems. But suddenly, with no changes other than rebuilding ESP-IDF, I now get frequent crashes from spinlock_acquire. I get the following error message:

assert failed: spinlock_acquire spinlock.h:122 (result == core_id || result == SPINLOCK_FREE)

In some cases I get a different error from compare_set.h, which appears to come from a function called just prior to the assertion in spinlock_acquire. There's no error message for this other than a panic message and a stack trace including compare_set at the top:

Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.
...
0x403805cd - compare_and_set_native
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/esp_hw_support/include/soc/compare_set.h:25
0x403834e9 - vPortEnterCritical
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/freertos/port/xtensa/include/freertos/portmacro.h:578
0x40383516 - multi_heap_malloc_impl
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/multi_heap.c:186
0x403769bb - heap_caps_malloc_base
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_caps.c:175
0x40376a0c - heap_caps_malloc_default
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_caps.c:230
0x403842a5 - malloc
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/newlib/heap.c:24
...

It seems that it's calling compare_and_set_native, which may be incorrect because I have SPI RAM support enabled and accessibe through malloc. Altohugh it looks like this is specifically an error caused by dynamic allocation, rather than the mutex lock. However, it seems the common thread between these issues is the compare_and_set function. Possibly related to #9120 since I am also using SPI RAM on an ESP32-S3.

~~The specific interaction that causes this is when I perform an HTTPS request to the device with the https server component, which then writes to NVS. It may be some complex interaction between these.~~ The problem occurs in general with some, but not all, dynamic allocations with malloc (more details below). ~~Also, because it only started doing this recently, I think it is likely some (relatively) recent change to the ESP-IDF v4.4 release.~~ I've reverted my version of ESP-IDF back to the version when my project was last working (from a clean build), and the problem persists (more details below).

AxelLin commented 1 year ago

957505136bb31c7fa17f7f47ff4869205e191fad is already fixed in v4.4 branch, you need to use v4.4.2+ to get the fix.

dragazo commented 1 year ago

Well I'm using the head of the release/v4.4 branch, and it looks like that commit is indeed included in the version I'm using. And in my sdkconfig.defaults file I have CONFIG_ESP32S3_SPIRAM_SUPPORT=y, so it should be using that code, and yet isn't for some reason. Either this is a config issue, or this particular mutex really is in internal memory (hence the compare_and_set_native call). From the call stack, this is during dynamic allocation - I haven't looked into that part in detail, but I imagine that's a mutex lock for the global allocator used by malloc, which would make sense to be in internal memory.

ESP-Marius commented 1 year ago

I can't use anything after v4.4 because websockets were removed)

You can still use websockets, the functionality just got moved to a separate managed component instead of being in the esp-idf repo: https://components.espressif.com/components/espressif/esp_websocket_client

Well I'm using the head of the release/v4.4 branch, and it looks like that com mit is indeed included in the version I'm using.

As far as I can see 4e03a9c3 do not include that fix. Try fetching the latest release/v4.4 and see if the error goes away

dragazo commented 1 year ago

You can still use websockets, the functionality just got moved to a separate managed component

Ah, well I'll have to look into that for the future, but for now I'm just struggling to get my project back to working order lol

Try fetching the latest release/v4.4 and see if the error goes away

The 4e03a9c3 part is just what's shown in the boot serial out (I (49) boot: ESP-IDF v4.4-dev-3540-g4e03a9c34c 2nd stage bootloader idk if that's actually a commit hash or not). But the commit I actually have checked out is 3cec3a002609 which should have the change. When I go to the spinlock.h file, it at least has those changes included in the source. And I've deleted compile artifacts and rebuilt, so it should be using the up to date version.

dragazo commented 1 year ago

One thing I was wondering: are the values in sdkconfig.defaults directly set as their preprocessor macro values? If so, CONFIG_ESP32S3_SPIRAM_SUPPORT=y would have CONFIG_ESP32S3_SPIRAM_SUPPORT be falsey in the #if expression since defined is no longer used, which would result in the external RAM support not being enabled in spinlock.h.

Another update is that the assertion fail does not always happen just with the https interaction - it now also happens on a normal string copy (but still during a call to malloc) during my program initialization. Unfortunately, nothing has changed since then aside from adding some debugging output, which makes me think it's an issue with where things get put in the binary or memory.

dragazo commented 1 year ago

I've also tracked it down to the multi-heap malloc lock: https://github.com/espressif/esp-idf/blob/3cec3a0026098d1b027f2103ec154a15baf97318/components/heap/multi_heap.c#L186

I'll try to get a minimal reproducible example, but so far when I remove a bunch of other stuff it seems to not happen (I have several background threads that can do allocations, etc.). But given that the crash is in the malloc impl, I don't think this is caused by my code itself doing something wrong - maybe just parallel malloc contention on that mutex.

ESP-Marius commented 1 year ago

The 4e03a9c3 part is just what's shown in the boot serial out (I (49) boot: ESP-IDF v4.4-dev-3540-g4e03a9c34c 2nd stage bootloader idk if that's actually a commit hash or not). But the commit I actually have checked out is 3cec3a002609 which should have the change. When I go to the spinlock.h file, it at least has those changes included in the source. And I've deleted compile artifacts and rebuilt, so it should be using the up to date version.

It should be the commit ID of the esp-idf version you are booting your app with. So i'd suggest you first triple check that you are indeed compiling/flashing the expected esp-idf project (e.g. add an assert in the spinlock.h that you see contain the fix and make sure that it is actually this file that is being used.

One thing I was wondering: are the values in sdkconfig.defaults directly set as their preprocessor macro values? If so, CONFIG_ESP32S3_SPIRAM_SUPPORT=y would have CONFIG_ESP32S3_SPIRAM_SUPPORT be falsey in the #if expression since defined is no longer used, which would result in the external RAM support not being enabled in spinlock.h.

Setting it in sdkconfig.defaults would result in a #define CONFIG_ESP32S3_SPIRAM_SUPPORT 1 in sdkconfig.h so it should work fine.

I'll try to get a minimal reproducible example, but so far when I remove a bunch of other stuff it seems to not happen (I have several background threads that can do allocations, etc.).

A minimal example would be great!

But given that the crash is in the malloc impl, I don't think this is caused by my code itself doing something wrong - maybe just parallel malloc contention on that mutex.

From my experience these crashes from the heap are almost always due to memory corruption in the user app, if there was issues with e.g. the malloc lock then we would likely see a lot of reported issues on it. Never know though so I'll of course keep following up in this issue.

dragazo commented 1 year ago

It should be the commit ID of the esp-idf version you are booting your app with. So i'd suggest you first triple check that you are indeed compiling/flashing the expected esp-idf project (e.g. add an assert in the spinlock.h that you see contain the fix and make sure that it is actually this file that is being used.

I'm not sure where that value is coming from, but I added an assert(false) to spinlock_acquire, did a fresh compile, and it does indeed halt at the new assert. And when I run git log in my esp-idf directory, it shows commit 3cec3a, so it should really be the up to date version despite the id weirdness. And the only changes I've made are adding some debugging stuff like this assert.

From my experience these crashes from the heap are almost always due to memory corruption in the user app, if there was issues with e.g. the malloc lock then we would likely see a lot of reported issues on it. Never know though so I'll of course keep following up in this issue.

The application itself is written in Rust, and I'm just using shallow FFI bindings to ESP-IDF. So basically it maps the global allocator directly to malloc, and so on for all the other esp-defined functions. So I believe my issue is still with ESP-IDF under the hood, and Rust should not even allow the type of memory corruption that you're referring to. Worst case it would panic out at runtime at the Rust level, but I don't even see that happening - stack traces always show it dipping down into ESP-IDF and then breaking there. And more confusingly, my whole project was working fine for months up until a few days ago when I did a fresh recompile (of ESP-IDF), so I know the Rust bindings are sound (and looking at them, they're literally just making extern C calls).

ESP-Marius commented 1 year ago

but I added an assert(false) to spinlock_acquire, did a fresh compile, and it does indeed halt at the new assert.

:+1:

And more confusingly, my whole project was working fine for months up until a few days ago when I did a fresh recompile (of ESP-IDF), so I know the Rust bindings are sound (and looking at them, they're literally just making extern C calls)

Lets see if we can get to the bottom of this. How about this:

before assert(result == core_id || result == SPINLOCK_FREE); add e.g.:

if(!(result == core_id || result == SPINLOCK_FREE)) {
    esp_rom_printf("lock = %p, result = %d, core_id = %d, lock_owner = %X \n", lock, result, core_id, lock->owner);
}

This should at least tell us abit about how the assert fails, if it is a lock in PSRAM etc.

I'd also suggest enabling heap poisoning if you are not already using it, and see if that can help you track down the root cause.

More about how heap memory debugging can be seen here, including information about heap poisoning.

And of course a minimal example capable of reproducing this would be the best, but I know that these kind of errors tends to disappear as soon as you try to narrow your application down a bit...

dragazo commented 1 year ago

I've removed all other threads from my project (at least those I created myself), and removed a few other components of my code. The error now sometimes doesn't happen (around 10% success rate), and when it does fail, it fails in different ways, but always in relation to dynamic allocation. Here are some of the new errors it gives:

assert failed: insert_free_block heap_tlsf.c:233 (current && "free list cannot have a null entry")
Backtrace: 0x4037626a:0x3fca3360 0x4037d499:0x3fca3380 0x40384031:0x3fca33a0 0x40382355:0x3fca34c0 0x40383264:0x3fca34e0 0x40376b59:0x3fca3500 0x40384061:0x3fca3520 0x42087416:0x3fca3540 0x42022458:0x3fca3560 0x42022d38:0x3fca3580 0x42022d4d:0x3fca35a0 0x42059e13:0x3fca35c0 0x4205a13a:0x3fca35e0 0x4205a15e:0x3fca3600 0x4205a143:0x3fca3620 0x4205a15e:0x3fca3640 0x4205a143:0x3fca3660 0x4201ca16:0x3fca3680 0x420201d1:0x3fca3cd0 0x42021e0f:0x3fca4090 0x421042c3:0x3fca4420 0x42020662:0x3fca4440 0x42030333:0x3fca44c0 0x4210861b:0x3fca44e0
0x4037626a - panic_abort
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/esp_system/panic.c:408
0x4037d499 - esp_system_abort
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/esp_system/esp_system.c:137
0x40384031 - __assert_func
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/newlib/assert.c:85
0x40382355 - block_merge_prev
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_tlsf.c:343
0x40383264 - multi_heap_free_impl
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/multi_heap.c:212
0x40376b59 - heap_caps_free
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_caps.c:382

Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.
Core  0 register dump:
PC      : 0x403825ce  PS      : 0x00060133  A0      : 0x80383267  A1      : 0x3fca3560  
0x403825ce - insert_free_block
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_tlsf.c:237
A2      : 0x3fc9eebc  A3      : 0xb33fffff  A4      : 0x00000018  A5      : 0x3d81910c  
A6      : 0x00000002  A7      : 0x42022d40  A8      : 0x3fce9678  A9      : 0x00000001  
0x42022d40 - alloc::raw_vec::RawVec<T,A>::current_memory
    at /home/devin/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs:241
A10     : 0x3fc9efa0  A11     : 0x00000003  A12     : 0x3fce9680  A13     : 0x3fc9ef14  
A14     : 0x00000002  A15     : 0x00000005  SAR     : 0x0000001b  EXCCAUSE: 0x0000001d  
EXCVADDR: 0xb340000b  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0x00000000  
0x40056f5c - rom_rx_gain_force
    at ??:??
0x40056f72 - rom_rx_gain_force
    at ??:??
Backtrace: 0x403825cb:0x3fca3560 0x40383264:0x3fca3580 0x40376b59:0x3fca35a0 0x40384061:0x3fca35c0 0x42087416:0x3fca35e0 0x42022458:0x3fca3600 0x4200ebeb:0x3fca3620 0x42059e2b:0x3fca3640 0x4205a13a:0x3fca3660 0x4201ca16:0x3fca3680 0x420201d1:0x3fca3cd0 0x42021e0f:0x3fca4090 0x421042c3:0x3fca4420 0x42020662:0x3fca4440 0x42030333:0x3fca44c0 0x4210861b:0x3fca44e0
0x403825cb - insert_free_block
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_tlsf.c:235
0x40383264 - multi_heap_free_impl
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/multi_heap.c:212
0x40376b59 - heap_caps_free
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_caps.c:382
0x40384061 - free
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/newlib/heap.c:39

But I reverted those minimization attempts to test out the original problem with the logging you suggested (thanks for that btw, I was wanting to do that myself but didn't know how to do it in esp's source lol). Here's what comes up when doing that:

lock = 0x3c187fa0, result = 1667592812, core_id = 52685, lock_owner = 6365726C 
assert failed: spinlock_acquire spinlock.h:129 (result == core_id || result == SPINLOCK_FREE)
Backtrace: 0x4037628a:0x3fc9efc0 0x4037d4b9:0x3fc9efe0 0x403842c1:0x3fc9f000 0x40380693:0x3fc9f120 0x40383515:0x3fc9f160 0x40383542:0x3fc9f180 0x403769bf:0x3fc9f1a0 0x40376a10:0x3fc9f1d0 0x403842d1:0x3fc9f1f0 0x420d98f4:0x3fc9f210 0x420dc214:0x3fc9f230 0x4205b932:0x3fc9f250 0x4205c130:0x3fc9f270 0x4205c10c:0x3fc9f290 0x4205c1c8:0x3fc9f2b0 0x420bb25a:0x3fc9f2d0 0x4205c218:0x3fc9f2f0 0x4205c618:0x3fc9f310 0x42070188:0x3fc9f330 0x42089c26:0x3fc9ff40 0x42071de1:0x3fca00f0 0x42091457:0x3fca0d00 0x4208990a:0x3fca0e20 0x4207d1c2:0x3fca0f70 0x4207849b:0x3fca1500 0x42079fda:0x3fca18a0 0x4207849b:0x3fca1e30 0x42079fda:0x3fca21d0 0x4207849b:0x3fca2760 0x420975bb:0x3fca2b00 0x4209eecb:0x3fca31b0 0x4205144e:0x3fca3810 0x42055b8d:0x3fca3e60 0x4205b2ff:0x3fca44c0 0x42161963:0x3fca4850 0x42059bc1:0x3fca4870 0x4206b10f:0x3fca48f0 0x421661df:0x3fca4910
0x4037628a - panic_abort
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/esp_system/panic.c:408
0x4037d4b9 - esp_system_abort
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/esp_system/esp_system.c:137
0x403842c1 - __assert_func
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/newlib/assert.c:85
0x40380693 - spinlock_acquire
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/esp_hw_support/include/soc/spinlock.h:129
0x40383515 - vPortEnterCritical
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/freertos/port/xtensa/include/freertos/portmacro.h:578
0x40383542 - multi_heap_malloc_impl
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/multi_heap.c:186
0x403769bf - heap_caps_malloc_base
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_caps.c:175
0x40376a10 - heap_caps_malloc_default
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/heap/heap_caps.c:230
0x403842d1 - malloc
    at /home/devin/Desktop/netsblox-vm-esp32/.embuild/espressif/esp-idf/release-v4.4/components/newlib/heap.c:24

So based on the lock address and this comment it sounds like the lock is indeed allocated in PSRAM.

If CONFIG_HEAP_POISONING_COMPREHENSIVE=y (and a fresh compile) is enough to enable the heap corruption tests, it doesn't find anything prior to the assert crashing everything.

dragazo commented 1 year ago

I also just tried adding a bit more debugging to make sure spiram was supported and if esp-idf thinks the lock is in spiram. The new logging looks like this:

if(!(result == core_id || result == SPINLOCK_FREE)) {
    #if (CONFIG_ESP32_SPIRAM_SUPPORT || CONFIG_ESP32S3_SPIRAM_SUPPORT)
    int spiram_supported = 1;
    int lock_in_spiram = esp_ptr_external_ram(lock);
    #else
    int spiram_supported = 0;
    int lock_in_spiram = 0;
    #endif
    esp_rom_printf("spiram_supported = %d, lock_in_spiram = %d \n", spiram_supported, lock_in_spiram);
    esp_rom_printf("lock = %p, result = %d, core_id = %d, lock_owner = %X \n", lock, result, core_id, lock->owner);
}

And this is the output that I get:

spiram_supported = 1, lock_in_spiram = 0 
lock = 0x3c187fa0, result = 1667592812, core_id = 52685, lock_owner = 6365726C

So as far as I can tell, I have SPIRAM support enabled correctly, and the lock seems to be in SPIRAM based on the address (but correct me if I'm wrong), and yet ESP-IDF seems to think it's not allocated in SPIRAM as determined by esp_ptr_external_ram.

dragazo commented 1 year ago

I did a bit more digging and found out the 0x3c... range might not actually be the start of the SPIRAM page, contrary to the comment I was basing that on (here).

if(!(result == core_id || result == SPINLOCK_FREE)) {
    #if (CONFIG_ESP32_SPIRAM_SUPPORT || CONFIG_ESP32S3_SPIRAM_SUPPORT)
    esp_rom_printf("spiram_supported = 1, lock_in_spiram = %d \n", esp_ptr_external_ram(lock));
    #else
    esp_rom_printf("spiram_supported = 0 \n");
    #endif

    #if SOC_SPIRAM_SUPPORTED
    esp_rom_printf("soc_spiram_supported = 1, low = %p, high = %p \n", SOC_EXTRAM_DATA_LOW, SOC_EXTRAM_DATA_HIGH);
    #else
    esp_rom_printf("soc_spiram_supported = 0 \n");
    #endif

    esp_rom_printf("lock = %p, result = %d, core_id = %d, lock_owner = %X \n", lock, result, core_id, lock->owner);
}

spiram_supported = 1, lock_in_spiram = 0 
soc_spiram_supported = 1, low = 0x3d000000, high = 0x3e000000 
lock = 0x3c185370, result = 1667592812, core_id = 52685, lock_owner = 6365726C 
assert failed: spinlock_acquire spinlock.h:141 (result == core_id || result == SPINLOCK_FREE)

It looks like the SPIRAM range is now [0x3d000000, 0x3e000000) rather than the [0x3C000000, 0x3E000000) that was previously mentioned... Could this be the issue?

ESP-Marius commented 1 year ago

@dragazo good find! This doesnt seem right from what I can see in the TRM. PSRAM would normally be mapped from X to 0x3E000000, so this would explain why this issue hasnt seen many reports.

IDF master correctly sets #define SOC_EXTRAM_DATA_LOW 0x3C000000. Can just try to change it and see if that fixes the issue?

What is the size of your PSRAM?

dragazo commented 1 year ago

It looks like on ESP32-S3 (like I'm using), the starting address is set at 0x3D000000 as defined here. I just tried changing that to 0x3C000000, but it doesn't look like that solved the problem, although now spinlock does think the lock is allocated in PSRAM.

spiram_supported = 1, lock_in_spiram = 1 
soc_spiram_supported = 1, low = 0x3c000000, high = 0x3e000000 
lock = 0x3c187fa0, result = 1667592812, core_id = 52685, lock_owner = 6365726C

Looking more in that file, it looks like ESP32-S3 uses the [0x3C000000, 0x3D000000) range for DROM as defined here. But that doesn't seem right, cause it's doing a writing operation to it... I guess that would somewhat explain why I sometimes get a store prohibited exception (one of the errors posted above), though that doen't happen every time.

My PSRAM is 8MB, and it does get loaded in correctly (as determined by init logging and the heap free functions). Although it doesn't look like a heap init message is printed for that one, but that might not actually be an issue:

I (2281) heap_init: Initializing. RAM available for dynamic allocation:
I (2288) heap_init: At 3FC9F2F0 len 0004A420 (297 KiB): D/IRAM
I (2295) heap_init: At 3FCE9710 len 00005724 (21 KiB): STACK/DIRAM
I (2302) heap_init: At 3FCF0000 len 00008000 (32 KiB): DRAM
I (2308) heap_init: At 600FE000 len 00002000 (8 KiB): RTCRAM
I (2314) spiram: Adding pool of 8192K of external SPI memory to heap allocator

ESP-Marius commented 1 year ago

Ah yeah, external memory on S3 can be mapped to 0x3C000000~0x3E000000, but like you say, the 0x3C000000, 0x3D000000) area should only be used for DROM. So why are we ending up with a lock variable that points to that area? :thinking:

Another thing is that 6365726C doesnt look like a valid lock owner ID, but rather just random data.

Seeing as you are also getting other heap related errors (https://github.com/espressif/esp-idf/issues/11325#issuecomment-1535661552) I am still leaning towards memory being corrupted. I think it is unlikely to be related to compare_and_set_native vs extram bug that we fixed earlier.

ESP-Marius commented 1 year ago

Looking at your profile it seems to be related to https://github.com/dragazo/netsblox-vm-esp32 am I right? If you have a branch for me that can show this error I would be happy to take a look as well.

dragazo commented 1 year ago

@ESP-Marius Oh yes, that's right the right repo, thank you so much for offering! I just put together a branch called bug-fixing that has everything you'll need, but I'll warn you there's a little bit of setup...

Once you clone the repo, you can just do cargo run and it'll automatically pull the latest ESP-IDF v4.4, compile it and all the rust stuff, link, etc. and also flash the device and show a serial monitor (connected over USB). You'll need espflash installed for that part (cargo install espflash), as well as the esp rust toolchain (see the repo readme). Once it's running, it'll not crash the first time cause it's not configured. It'll print in serial out a server IP address. It'll host an access point where you can connect to its wifi (ssid nb-esp32 by default) and go to that IP in a browser to configure it. There you'll need to add wifi credentials (I tried removing all the parts that required networking, but that's practically the entire app unfortunately, and it causes the problem to disappear) and then reset the board. To keep the serial monitor I usually just ctrl+c and rerun cargo run. It should now connect to wifi (if it doesn't it'll just sit there doing nothing and say no client ip address), and if so should produce one of the errors I'm getting.

The ESP-IDF files can be found in the .embuild directory of the root project. Any changes you do to the source will require a fresh compile (cargo clean && cargo run). You might be able to get away with only deleting the esp-idf-sys build artifacts, but I usually play it safe and clean everything.

I'll also note my device is an ESP32-S3 N32R8. If you have less than 16MB of flash on your device, you'll need to change line 23 of .cargo/config.toml accordingly. Some other options are ESP32-S3 specific, so likely won't work on other chips without some deeper config changes.

ESP-Marius commented 1 year ago

Turns out me messing around with rust aint very effective as I basically know nothing about it heh.

Some more analysis that might help: if the lock is really placed into rodata and DROM, then we should be able to look up the address 0x3c187fa0 somehow and see the symbol. (In C I would look in the map/elf file, unsure how this works with rust..)

When booting the binary I got

I (118) esp_image: segment 0: paddr=00010020 vaddr=3c170020 size=6a17ch (434556) map. Which indicate that 0x3c187fa0 is indeed one of the addresses mapped to DROM.

idea--list commented 1 year ago

Find this ticket interesting as i am also facing spinlock_acquire errors for quite a while on my ESP32S3. In my case it occurs every 6-10 hours during data logging to an SD card over SPI bus. I have 2 sensors logging data in 2 different files. And 2 tasks accessing the SD card: one task creates the files, adds data to them from a ringbuffer using fwrite(), closes the files. Another task reads previous files and transfers them to a remote server over FTP. I understand that SPI master driver is not thread safe and thus i need to prevent the above 2 tasks to access the SD card in the same time. Well i did my best and tried to wrap around all the parts trying to access the SD card with eventgroup bits and also with a mutex as suggested in the documentation of the SPI master APIs. While both methods seem to reduce the frequency of spinlock_acquire error, neither of these methods solve it completely. I only face it on long logging sessions, but still this renders my data logging app unreliable. Will follow this ticket, maybe it helps me overcoming my issue.

dragazo commented 1 year ago

@ESP-Marius Ah yeah, I thought that might happen, especially since I just dumped a big repo on you lol Well tbh I'm not sure how to do symbol lookup like that in rust either, since I've never had to do it. But I can check into that when I get some free time again. In the meantime, I found increasing the main process stack size seems to avoid this issue, but I need more testing to be sure it actually is solving the problem and not just conveniently bypassing it in this specific instance (like a location-specific fix) or delaying it somehow. If that is the solution, it seems that there is memory corruption, but specifically from a stack overflow (which is apparently not checked at runtime?) which wasn't even on my radar cause of the whole spinlock assert thing. Is there a config option you know of to enable stack overflow protection so I can see if that's the issue? Other than that, I'll also try changing the PSRAM settings to make everything be allocated in PSRAM and see if the issue still persists. I can test that sometime in the next few days, and if all goes well I'll close the issue.

@idea--list you might want to try the increased main process stack size thing and see if that helps. Given your description, it doesn't sound like it would fix it, but might as well try.

ESP-Marius commented 1 year ago

If that is the solution, it seems that there is memory corruption, but specifically from a stack overflow (which is apparently not checked at runtime?) which wasn't even on my radar cause of the whole spinlock assert thing. Is there a config option you know of to enable stack overflow protection so I can see if that's the issue?

Yeah, from all the info I still think memory corruption is a likely candidate. IDF has a couple of helpers to detect stackoverflow:

dragazo commented 1 year ago

Alright, I just added CONFIG_COMPILER_STACK_CHECK_MODE_ALL=y to my sdkconfig.defaults file, recompiled, and it does indeed catch a stack overflow. Increasing the stack size a bit makes that error go away, but still has a store prohibited exception. But increasing it even higher seems to fix that problem too. I also tried setting CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL=0 to make all allocations be in PSRAM, and still no problems. So this issue is almost certainly just memory corruption resulting from stack overflow. I'll go ahead and close this issue and reopen if the old spinlock issue comes back but stack size doesn't fix it. @ESP-Marius thank you so much for your help!

ZanzyTHEbar commented 1 year ago

Sorry to open this - i have a project that works on all other ESP boards we've tried, but this exact issue is occuring on the esp32s3 XIAO Sense board that just came out. Any suggestions - note, we are using the Arduino Core and not ESP IDF.

When ever we use any wifi functions on the board, the chip core panics out with this error.

ESP-Marius commented 1 year ago

@ZanzyTHEbar Can you provide some more information to help debug it?

Full log from the crash
Which Arduino version you are using.
A producible example would also be great if possible (but if it only fails ony that chip it might not help much as I dont have one)

ZanzyTHEbar commented 1 year ago

No problem, thanks for replying so soon :)

Here is the log and backtrace:

assert failed: spinlock_acquire spinlock.h:122 (result == core_id || result == SPINLOCK_FREE)

Backtrace: 0x40377386:0x3fceb720 0x4037d47d:0x3fceb740 0x403840d1:0x3fceb760 0x40380995:0x3fceb890 0x40383a15:0x3fceb8d0 0x40383d15:0x3fceb8f0 0x403777cd:0x3fceb910 0x40377915:0x3fceb940 0x4037755e:0x3fceb9a0 0x40377571:0x3fceb9c0 0x42043336:0x3fceb9e0 0x4202a2a2:0x3fceba00 0x42096b7b:0x3fceba20 0x42097377:0x3fcebbb0 0x42097490:0x3fcebbd0 0x42098700:0x3fcebbf0 0x4200a366:0x3fcebd60 0x4200aaf5:0x3fcebdd0 0x42003c55:0x3fcebe10 0x4200410f:0x3fcebe60 0x42010b9f:0x3fcebeb0

I am using platformio version 6.1.7 with arduino espressif core 6.1.0 which has Arduino Core v2.0.7

The example is my project located here - we have functioning support already for most Camera-based ESP32 boards.

If it is more relevant i can move my issue to the platformio-espressif32 repo, i only started my inquiry here as this issue is the same error we are facing. Though, only in relation to the esp32s3 xiao boards.

Note: i just saw here that the platform-espressif32 for pio has an update that adds support for the SeeedStudio Xiao ESP32S3 and some other esp32s3 based boards - so i will upgrade and try again later today and report back if that resolves the issue.

dragazo commented 1 year ago

@ZanzyTHEbar did you already try the stack overflow check mentioned above? Even if that doesn't show anything, you might want to try bumping up the stack size just to be sure. Also, you can add the logging above that prints the mutex address and other info that might be needed.

ZanzyTHEbar commented 1 year ago

I haven't tried to increase the stack size - yet, will try that now. i tried to add the logging mentioned above, it returned no values.

i also updated to the newest espressif32 platformio for pio and it did not solve the issue.

ZanzyTHEbar commented 1 year ago

Increasing the stack size isn't a solution anyways as it can only be changed from the sdkconfig.h file from the toolchain.

When i increased the stack-size the spin-lock error went awayandi received a core -panic dump:

Guru Meditation Error: Core  1 panic'ed (StoreProhibited). Exception was unhandled.
10:59:34.053 >
10:59:34.053 > Core  1 register dump:
10:59:34.053 > PC      : 0x403808f8  PS      : 0x00060733  A0      : 0x80383a18  A1      : 0x3fced890
10:59:34.053 > A2      : 0xd6f27157  A3      : 0xb33fffff  A4      : 0x0000cdcd  A5      : 0x00060723
10:59:34.053 > A6      : 0x00060720  A7      : 0x0000abab  A8      : 0x0000abab  A9      : 0xffffffff
10:59:34.053 > A10     : 0x00000000  A11     : 0x40377568  A12     : 0x3fceda28  A13     : 0x3fc983d4
10:59:34.053 > A14     : 0x99f27157  A15     : 0x00ffffff  SAR     : 0x00000008  EXCCAUSE: 0x0000001d
10:59:34.053 > EXCVADDR: 0xd6f27157  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff

The mentioned:

esp_rom_printf("lock = %p, result = %d, core_id = %d, lock_owner = %X \n", lock, result, core_id, lock->owner);

Did nothing as well - no extra logging.

dragazo commented 1 year ago

How much did you increase stack size? From my problem above, when I increased it some I also got this store prohibited exception, but after doubling it again it went away. If you can add config options, CONFIG_COMPILER_STACK_CHECK_MODE_ALL=y was very useful. Aside from that, do you have any context on when this happens in your actual code?

ZanzyTHEbar commented 1 year ago

How much did you increase stack size? From my problem above, when I increased it some I also got this store prohibited exception, but after doubling it again it went away. If you can add config options, CONFIG_COMPILER_STACK_CHECK_MODE_ALL=y was very useful. Aside from that, do you have any context on when this happens in your actual code?

Yes - whenever i use any wifi related methods. The arduino-espressif core provides the WiFi.h header, whenever i use any methods from this class, the stackoverflow and memory corruption occur. Only on the esp32s3 xiao sense boards though - exact same code works flawlessly on other esp32 models.

I increased the stack size to 16384 - which is double the default. Increasing it to 33668 is just ridiculous - that is a massive stack and no task should need that stack size. Especially when the exact same code works on other esp32s flawlessly without increasing the stack size from the default, including other esp32s3 modules (i tested on the wrover variant).

To add to that, increasing the stacksize is just a bandaid in my opinion. Not only does it force us to modify the sdkconfig.h header of the toolchain, which is not a portable solution, but it forces us to increase the stack size to a ludicrous amount. 32kb of stack is just massive - for no reason.

Since i am not using the ESP-IDF i am not exactly sure where to find the CONFIG_COMPILER_STACK_CHECK_MODE_ALL setting - i have looked and can't find it in the Arduino Core toolchain. Happy to give it a try to get better logging though - as all of the other logging options proposed here have not yielded any results. The arduino core seems to be overriding a lot of these settings.

Thank you all for the help so far. I am beginning to think this is just a problem with the esp32s3 support package.

Note: I will test increasing the stack size to 33668 - it's worth the test to see if this "solves" the issue. However, if it does, I am not satisfied that this is a solution and am curious where to go from here to try and get this properly resolved.

ZanzyTHEbar commented 1 year ago

I just tried to increase the stack size to 33668 and it did nothing - in fact the spinlock error is now back. I erased the flash and tried 16384 again - and spinlock error again. For some reason, I am not getting the core-panic anymore at all, on any stack size. I have not modified the code itself, just the stack size. Clearly increasing the stack size is not a viable solution, even if it works occasionally.

ESP-Marius commented 1 year ago

Yes - whenever i use any wifi related methods. The arduino-espressif core provides the WiFi.h header, whenever i use any methods from this class, the stackoverflow and memory corruption occur. Only on the esp32s3 xiao sense boards though - exact same code works flawlessly on other esp32 models.

Can you provide a reference to exactly which config file these "other models" are so we can see how the configs differ?

I increased the stack size to 16384 - which is double the default. Increasing it to 33668 is just ridiculous

Yes, if it was a stackoverflow from that task then the initial increase would already have fixed it.

To add to that, increasing the stacksize is just a bandaid in my opinion.

If the error is caused by a stackover flow then it is an actual fix. If it just happen to work because stuff gets moved around in memory then yes it is not a proper fix.

The arduino core seems to be overriding a lot of these settings.

As far as I understand the ardunio core is using a pre-compiled version of IDF, so the settings/adding logging wont actually do anything.

Most likely candidate is still a memory corruption as the error seems to jump around depending on memory layout (and EXCVADDR: 0xd6f27157 is trying to write to a random memory address that doesnt exists). I suggest trying to isolate different parts of your program and try to figure out exactly what causes the issue. If you are able to produce a simple example that can reproduce the error then I'm happy to take a look as wel..

ZanzyTHEbar commented 1 year ago

Yes - whenever i use any wifi related methods. The arduino-espressif core provides the WiFi.h header, whenever i use any methods from this class, the stackoverflow and memory corruption occur. Only on the esp32s3 xiao sense boards though - exact same code works flawlessly on other esp32 models.

Can you provide a reference to exactly which config file these "other models" are so we can see how the configs differ?

I increased the stack size to 16384 - which is double the default. Increasing it to 33668 is just ridiculous

Yes, if it was a stackoverflow from that task then the initial increase would already have fixed it.

To add to that, increasing the stacksize is just a bandaid in my opinion.

If the error is caused by a stackover flow then it is an actual fix. If it just happen to work because stuff gets moved around in memory then yes it is not a proper fix.

The arduino core seems to be overriding a lot of these settings.

As far as I understand the ardunio core is using a pre-compiled version of IDF, so the settings/adding logging wont actually do anything.

Most likely candidate is still a memory corruption as the error seems to jump around depending on memory layout (and EXCVADDR: 0xd6f27157 is trying to write to a random memory address that doesnt exists). I suggest trying to isolate different parts of your program and try to figure out exactly what causes the issue. If you are able to produce a simple example that can reproduce the error then I'm happy to take a look as wel..

Thank you for your response,

Any, pick an ESP - we have tried it. WRover versions, WRoom versions. We have tested all ESP32 variants that havea camera except for the ESP32S2 as it's not relevant to our project. Have even tested the ESP32S3 Wroom by Freenove, that works as well. All supported boards andtheir configs are here. Global configuration details are here

i have already isolated the section of the program that causes the issue, as i stated in my previous posts, wifi. If i comment out all of the wifi methods, program works fine except for wifi (of course, it's disabled). If i uncomment any wifi method from the WiFi.h header file (so not my code - code from Arduino/espressif) - the issue occurs. Note - wifi works fine on the other esp boards supported in the previously mentioned ini file.

As an example - on the esp32s3 xiao sense boards:

WiFi.begin(ssid, pass);

That fails, and causes the spin-lock error. Simply calling the begin method of the wifi stack.

Apologies or my delayed response, i've been super busy at work. Thank you for your continued assistance 🙂.

dragazo commented 1 year ago

Well given the weird address being reported, I'm still guessing this is a memory corruption issue, even if it's not stack-based like mine was. What if you make a new project that literally just calls WiFi.begin("literal ssid", "literal pass"); with no other code. If that succeeds, it's probably just your other initialization logic that causes some memory corruption somewhere that breaks the wifi module.

ZanzyTHEbar commented 1 year ago

I'd like to close this saying that i fixed it. The issue was two fold - one i didn't have the newest version of the seed_xiao_esp32s3 board package. And two, i was using pin 33 for the led, as that's the led pin for the espcam modules, and on the esp32s3 Xiao board pin 33 is SPICLK - which was corrupting the signal on the SPI line, going to the PSRAM chip.

After fixing both of these issues, the board now works. I apologize for the inconvenience, Hopefully my troubles help someone else in the future.