espressif / esp-matter

Espressif's SDK for Matter
Apache License 2.0
693 stars 156 forks source link

CHIP SDK Hangs randomly and device reboots due to watch dog timer timeout (CON-724) #598

Closed aggaddam closed 11 months ago

aggaddam commented 1 year ago

Describe the bug CHIP SDK Hangs randomly and the device reboots due to watchdog timer timeout. The watchdog timeout period is set to 30 seconds (Which means the idle task didn't get a chance to execute itself within 30 seconds). Suspecting CHIP SDK is blocking CPU time and not releasing it back.

Environment :

Any additional details ...

aggaddam commented 1 year ago

I got few more logs (Device side and Chip side logs) for the device crash during network pairing. Attached logs here for both device side and chiptool side

Chip Tool Log: chip_log1.txt

Esp32C3 Device Log: esp32c3_device_crash1.txt

PSONALl commented 1 year ago

Can you use latest esp-matter commit 68e99de70266674ce172459136960bbb5c58dc0e ? I dont see the issue there

PSONALl commented 1 year ago

Esp32C3 Device Log: esp32c3_device_crash1.txt

@aggaddam From these logs it looks like the crash is due to Matter resources being accessed from a non-Matter thread For more details, you can check: https://docs.espressif.com/projects/esp-matter/en/latest/esp32/faq.html#a1-11-chip-stack-locking-error-code-is-unsafe-racy

It is possible that your code is calling a matter API from a non-matter thread, Can you send a backtrace of your code before the crash happens

aggaddam commented 1 year ago

Can you use latest esp-matter commit 68e99de ? I dont see the issue there

I am seeing a few warnings/errors when I run the matter example code for the Room Air Conditioner device type with the latest ESP-Matter SDK (https://github.com/espressif/esp-matter/commit/68e99de70266674ce172459136960bbb5c58dc0e) that you pointed out. Do you know if this is expected behavior?

Here are the errors/warnings

Screenshot from 2023-08-30 15-20-18

aggaddam commented 1 year ago

Esp32C3 Device Log: esp32c3_device_crash1.txt

@aggaddam From these logs it looks like the crash is due to Matter resources being accessed from a non-Matter thread For more details, you can check: https://docs.espressif.com/projects/esp-matter/en/latest/esp32/faq.html#a1-11-chip-stack-locking-error-code-is-unsafe-racy

It is possible that your code is calling a matter API from a non-matter thread, Can you send a backtrace of your code before the crash happens

I am calling matter API's with lock protection. Check this pseudo code


void matter_helperfcn_stop_commissioning_window(void)
{
    // Close the commission window
    lock::status_t lock_status = lock::chip_stack_lock(portMAX_DELAY);

    CommissioningWindowManager & commissionMgr = Server::GetInstance().GetCommissioningWindowManager();
    commissionMgr.CloseCommissioningWindow();

    if (lock::SUCCESS == lock_status)
    {
        lock::chip_stack_unlock();
    }    
}

Even without calling CHIP SDK APIs also, I am seeing random device hangs with the latest ESP-matter SDK and watch dog timer resets the system

Here is the crash and back trace


D (30434) chip[DMG]: Command handler moving to [CommandSen]
D (30434) chip[DMG]: Command handler moving to [AwaitingDe]
I (30514) chip[DL]: Confirm received for CHIPoBLE TX characteristic indication (con 1) status= 14 
I (30514) CHIP[DL]: Write request received for CHIPoBLE RX characteristic con 1 12
I (30524) chip[EM]: >>> [E:13530r S:61336 M:83290419] (S) Msg RX from 1:FFFFFFFB00000000 [9B28] --- Type 0001:08 (IM:InvokeCommandRequest)
D (30534) chip[EM]: Handling via exchange: 13530r, Delegate: 0x3fc9560c
D (30534) chip[DMG]: InvokeRequestMessage =
D (30554) chip[DMG]: {
D (30554) chip[DMG]: suppressResponse = false, 
D (30554) chip[DMG]: timedRequest = false, 
D (30564) chip[DMG]: InvokeRequests =
D (30564) chip[DMG]: [
D (30564) chip[DMG]: CommandDataIB =
D (30564) chip[DMG]: {
D (30574) chip[DMG]: CommandPathIB =
D (30574) chip[DMG]: {
D (30574) chip[DMG]: EndpointId = 0x0,
D (30584) chip[DMG]: ClusterId = 0x30,
D (30584) chip[DMG]: CommandId = 0x0,
D (30584) chip[DMG]: },
D (30584) chip[DMG]: 
D (30594) chip[DMG]: CommandFields = 
D (30594) chip[DMG]: {
D (30604) chip[DMG]: 0x0 = 66, 
D (30604) chip[DMG]: 0x1 = 22, 
D (30614) chip[DMG]: },
D (30614) chip[DMG]: },
D (30614) chip[DMG]: 
D (30614) chip[DMG]: ],
D (30624) chip[DMG]: 
D (30624) chip[DMG]: InteractionModelRevision = 1
D (30624) chip[DMG]: },
D (30624) chip[DMG]: Received command for Endpoint=0 Cluster=0x0000_0030 Command=0x0000_0000
I (30634) esp_matter_command: Received command 0x00000000 for endpoint 0x0000's cluster 0x00000030
I (30644) chip[FS]: GeneralCommissioning: Received ArmFailSafe (66s)
I (30654) esp_matter_attribute: ********** W : Endpoint 0x0000's Cluster 0x00000030's Attribute 0x00000000 is 22 **********
W (30674) MATT_APP: attribute_update_cb: 0, 48, 0
D (30674) chip[DMG]: Command handler moving to [ Preparing]
D (30674) chip[DMG]: Command handler moving to [AddingComm]
D (30684) chip[DMG]: Command handler moving to [AddedComma]
D (30684) chip[DMG]: Decreasing reference count for CommandHandler, remaining 0
I (30694) chip[EM]: <<< [E:13530r S:61336 M:57366872] (S) Msg TX to 1:FFFFFFFB00000000 [9B28] --- Type 0001:09 (IM:InvokeCommandResponse)
I (30714) chip[IN]: (S) Sending msg 57366872 on secure session with LSID: 61336
D (30724) CHIP[DL]: Sending indication for CHIPoBLE TX characteristic (con 1, len 71)
I (30724) NimBLE: GATT procedure initiated: indicate; 
I (30734) NimBLE: att_handle=14

D (30734) chip[DMG]: Command handler moving to [CommandSen]
D (30734) chip[DMG]: Command handler moving to [AwaitingDe]
I (30784) chip[DL]: Confirm received for CHIPoBLE TX characteristic indication (con 1) status= 14 
I (30784) CHIP[DL]: Write request received for CHIPoBLE RX characteristic con 1 12
I (30794) chip[EM]: >>> [E:13531r S:61336 M:83290420] (S) Msg RX from 1:FFFFFFFB00000000 [9B28] --- Type 0001:08 (IM:InvokeCommandRequest)
D (30804) chip[EM]: Handling via exchange: 13531r, Delegate: 0x3fc9560c
D (30804) chip[DMG]: InvokeRequestMessage =
D (30824) chip[DMG]: {
D (30824) chip[DMG]: suppressResponse = false, 
D (30824) chip[DMG]: timedRequest = false, 
D (30834) chip[DMG]: InvokeRequests =
D (30834) chip[DMG]: [
D (30834) chip[DMG]: CommandDataIB =
D (30834) chip[DMG]: {
D (30844) chip[DMG]: CommandPathIB =
D (30844) chip[DMG]: {
D (30844) chip[DMG]: EndpointId = 0x0,
E (30854) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (30854) task_wdt:  - IDLE (CPU 0)
E (30854) task_wdt: Tasks currently running:
E (30854) task_wdt: CPU 0: tiT
E (30854) task_wdt: Aborting.

Core 2 register dump:
MEPC    : 0x4038bf5a  RA      : 0x4038b386  SP      : 0x3fcb1bd0  GP      : 0x3fc92000  
TP      : 0x3fc6d704  T0      : 0x4038b376  T1      : 0xfffffff0  T2      : 0x00000060  
S0/FP   : 0x00000044  S1      : 0x00000000  A0      : 0x3fcb0df8  A1      : 0x3fcb1cc8  
A2      : 0x00000007  A3      : 0x00000007  A4      : 0x3fcb1cc8  A5      : 0x3fcb1cc8  
A6      : 0x00000001  A7      : 0x00000000  S2      : 0x3fcb1c5c  S3      : 0xffffffff  
S4      : 0x00000000  S5      : 0x00000003  S6      : 0x00000004  S7      : 0x00000000  
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000  
T3      : 0x01000000  T4      : 0x00000000  T5      : 0x00000000  T6      : 0x00000000  
MSTATUS : 0x3fcb1c5c  MTVEC   : 0x00000000  MCAUSE  : 0x3fcb0dd4  MTVAL   : 0x4038a854  
MHARTID : 0x420800bc  

Stack memory:
3fcb1bd0: 0x3fcb1c5c 0x00000000 0x3fcb0dd4 0x4038a854 0x420800bc 0x3c163984 0x3fcbcb00 0x00000044
3fcb1bf0: 0x420800bc 0x3c163984 0x00000000 0x00000bb8 0x00000004 0x00000003 0x00000000 0x00000002
3fcb1c10: 0x3fca4000 0xffffffff 0x3fcbcb00 0x4208865a 0x3fca4000 0x420800bc 0x000073f0 0x3fcb1c5c
3fcb1c30: 0x3fca4000 0xffffffff 0x3fca4000 0x42080296 0x3fca4000 0xffffffff 0x3fcbcb00 0x42078f0e
3fcb1c50: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1c70: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1c90: 0x00000000 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0x00000154
3fcb1cb0: 0x3fcb1b50 0x00000bfc 0x3fc94830 0x3fc94830 0x3fcb1cb0 0x3fc94828 0x00000007 0x3fcb1cc8
3fcb1cd0: 0x3fcb1cc8 0x3fcb1cb0 0x3fcb0df8 0x00000012 0x3fcb0eac 0x00546974 0xa5a5a5a5 0xa5a5a5a5
3fcb1cf0: 0x0015bfec 0x00000000 0x3fcb1ca0 0x00000012 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1d10: 0x3fca62dc 0x3fca6344 0x3fca63ac 0x00000000 0x00000000 0x00000001 0x00000000 0x00000000
3fcb1d30: 0x00000000 0x42124e88 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1d50: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1d70: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1d90: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1db0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1dd0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1df0: 0x00000000 0x00000000 0x00000000 0x00000000 0x3f000000 0x00000054 0x3fcb1e08 0x3fcb1e08
3fcb1e10: 0x3fcb1e08 0x3fcb1e08 0x00000000 0x3fcb1e20 0xffffffff 0x3fcb1e20 0x3fcb1e20 0x00000000
3fcb1e30: 0x3fcb1e34 0xffffffff 0x3fcb1e34 0x3fcb1e34 0x00000001 0x00000001 0x00000000 0x0000ffff
3fcb1e50: 0x00000000 0xb33fffff 0x00000000 0x00000060 0x153bdaec 0x0000e806 0x3fcb1ec4 0x3fcb1ed4
3fcb1e70: 0x3fcb1ee4 0x420d7c52 0x420d7ba8 0x3fcb1ee4 0x00000000 0x00000000 0x3fcb2070 0x4208e506
3fcb1e90: 0x4208e4fc 0x4208e4f2 0x00000000 0x00000000 0x00000000 0x00000000 0x00000001 0x00000099
3fcb1eb0: 0x00000000 0x3fcb204c 0x3fcb2060 0x00000064 0x0000000c 0x00000000 0x00000000 0x00000000
3fcb1ed0: 0x0000000c 0x00000000 0x00000000 0x00000000 0x000000fc 0x3fc9e300 0x00000000 0x00000000
3fcb1ef0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1f10: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1f30: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000006 0x00000000 0x00000000
3fcb1f50: 0x00000000 0x00000000 0x00000000 0x00000006 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1f70: 0x00000000 0x00000006 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcb1f90: 0x00000000 0x42078fa4 0x42082d82 0x420d7b28 0x420e51e4 0x3fcb1e60 0x3fccc678 0x3fcbcb14
3fcb1fb0: 0x00000000 0x3c172694 0x05dc05dc 0x153bdaec 0x6a06e806 0x01017473 0x00000003 0x00000000

BackTrace:


riscv32-esp-elf-addr2line -pfiaC -e build/ac.elf 0x4038bf5a
0x4038bf5a: vListInsert at li_esp32matter_sdk/esp-idf/components/freertos/FreeRTOS-Kernel/list.c:171 (discriminator 1)
PSONALl commented 1 year ago

@aggaddam can you share a patch with your changes? I am not able to reproduce this error

aggaddam commented 1 year ago

@PSONALl This issue is more frequently observed when log level is set to debug instead of info. Please try with the log level set to debug and then try matter commissioning for at least 5-10 times.

This issue is also observed with log level info but the occurrence is less compared to debug log level

PSONALl commented 1 year ago

@aggaddam I tried to reproduce this issue by setting log level to debug and tried commissioning more than 10 times but I am not able to reproduce the error Please check if your changes are correct once

aggaddam commented 1 year ago

@PSONALl Somehow, i am unable to reproduce the issue with examples code. I am trying to reproduce the issue by adding some application changes to the example code. Will share the patch If i am able to reproduce the issue

dhrishi commented 11 months ago

I will close this now. Please re-open with the requested details if you encounter the issue