espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
12.93k stars 7.09k forks source link

[Debug Feature] print actual backtrace when Idle WDT fires (IDFGH-10181) #11449

Open chipweinberger opened 1 year ago

chipweinberger commented 1 year ago

Is your feature request related to a problem?

ESP-IDF: release/v4.4

Debugging Idle TWDT crashes is harder than it needs to be.

In the example below, we should print the backtrace of btApiTask, but instead we get the backtrace of the idle task (much less useful).

I think we could do this with a task snapshot. See: https://github.com/espressif/esp-idf/issues/9708

I (294133) pd midi reader seek: seeking to begining: 0x16 (firstMsgOffset)
I (294143) pd midi reader: open success.
I (294173) pd xtrans: header: Content-Length: 1914
E (354233) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (354233) task_wdt:  - IDLE (CPU 1)
E (354233) task_wdt: Tasks currently running:
E (354233) task_wdt: CPU 0: IDLE
E (354233) task_wdt: CPU 1: btApiTask
E (354233) task_wdt: Aborting.

abort() was called at PC 0x421b7608 on core 0
0x421b7608: task_wdt_isr at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_system/task_wdt.c:176 (discriminator 3)

Backtrace: 0x403782cb:0x3fcbb8a0 0x4038e74d:0x3fcbb8d0 0x403a65a7:0x3fcbb900 0x421b7608:0x3fcbb980 0x4037dc31:0x3fcbb9c0 0x4218405a:0x3fcf61f0 0x42007b1e:0x3fcf6220 0x403944de:0x3fcf6250
0x403782cb: panic_abort at /Volumes/User/MBP-Google-Drive/jamcorder/firmware/esp-idf/components/esp_system/panic.c:429

0x4038e74d: esp_system_abort at /esp-idf/components/esp_system/esp_system.c:137

0x403a65a7: abort at /esp-idf/components/newlib/abort.c:46

0x421b7608: task_wdt_isr at /esp-idf/components/esp_system/task_wdt.c:176 (discriminator 3)

0x4037dc31: _xt_lowint1 at /esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1114

0x4218405a: cpu_ll_waiti at /esp-idf/components/hal/esp32s3/include/hal/cpu_ll.h:185
 (inlined by) esp_pm_impl_waiti at /esp-idf/components/esp_pm/pm_impl.c:830

0x42007b1e: esp_vApplicationIdleHook at /esp-idf/components/esp_system/freertos_hooks.c:63

0x403944de: prvIdleTask at /esp-idf/components/freertos/tasks.c:3997 (discriminator 1)
chipweinberger commented 1 year ago

I see there have been a lot of changes to void task_wdt_isr(void *arg) in v5.0, so perhaps this work is already done?

see: https://github.com/espressif/esp-idf/commit/e25cda2c400b28362436b3ae01b969386892ab8e

chipweinberger commented 1 year ago

For anyone else on v4.4, a simple workaround is to disable CONFIG_ESP_TASK_WDT_PANIC in KCondig.

This will get you a stacktrace if you really need one. You could also look at the coredump.

dizcza commented 1 year ago

I still have the same problem with v5.1

===============================================================
==================== ESP32 CORE DUMP START ====================

Crashed task handle: 0x3ffc3ce4, name: 'IDLE', GDB name: 'process 1073495268'

================== CURRENT THREAD REGISTERS ===================
exccause       0x400833ec (Invalid EXCCAUSE code)
excvaddr       0x4008f3b6
epc1           0x400e2dbd
epc2           0x0
epc3           0x0
epc4           0x0
epc5           0x4008f2ba
epc6           0x0
eps2           0x0
eps3           0x0
eps4           0x0
eps5           0x60023
eps6           0x0
pc             0x4000bff0          0x4000bff0
lbeg           0x0                 0
lend           0x0                 0
lcount         0x0                 0
sar            0x1d                29
ps             0x60020             393248
threadptr      <unavailable>
br             <unavailable>
scompare1      <unavailable>
acclo          <unavailable>
acchi          <unavailable>
m0             <unavailable>
m1             <unavailable>
m2             <unavailable>
m3             <unavailable>
expstate       <unavailable>
f64r_lo        <unavailable>
f64r_hi        <unavailable>
f64s           <unavailable>
fcr            <unavailable>
fsr            <unavailable>
a0             0x80094dcb          -2146873909
a1             0x3ffc3b80          1073494912
a2             0x0                 0
a3             0x60023             393251
a4             0x60020             393248
a5             0x3f403828          1061173288
a6             0xb33fffff          -1287651329
a7             0xb33fffff          -1287651329
a8             0x0                 0
a9             0x1                 1
a10            0x7b096c            8063340
a11            0x3ffb096c          1073416556
a12            0x60423             394275
a13            0x3f403828          1061173288
a14            0xb33fffff          -1287651329
a15            0xb33fffff          -1287651329

==================== CURRENT THREAD STACK =====================
#0  0x4000bff0 in ?? ()
#1  0x40094dcb in vApplicationGetIdleTaskMemory (ppxIdleTaskTCBBuffer=0x3ffb5fac <port_uxOldInterruptState+4>, pp
xIdleTaskStackBuffer=0x3ffc3bc0, pulIdleTaskStackSize=0x4) at /home/dizcza/tools/esp-idf/components/freertos/port
_common.c:52
#2  0x400920b4 in find_entry_from_task_handle_and_check_all_reset (handle=0x0, all_reset=0x60023) at /home/dizcza
/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:146
#3  0x4009213a in prvCopyDataToQueue (pxQueue=0x40092134 <prvCopyDataToQueue>, pvItemToQueue=0x60020, xPosition=4
) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/queue.c:2356
#4  0x40092019 in esp_task_wdt_reset () at /home/dizcza/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:7
83
#5  0x40093093 in prvCheckTasksWaitingTermination () at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-K
ernel/tasks.c:4567
#6  prvIdleTask (pvParameters=<optimized out>) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/
tasks.c:4231
#7  0x40094b4d in uxInitialiseStackTLS (ret_threadptr_reg_init=<optimized out>, uxStackPointer=<optimized out>) a
t /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:263
#8  pxPortInitialiseStack (pxTopOfStack=<optimized out>, pxCode=0x0, pvParameters=0x0) at /home/dizcza/tools/esp-
idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:450

======================== THREADS INFO =========================
  Id   Target Id          Frame 
* 1    process 1073495268 0x4000bff0 in ?? ()
  2    process 1073491500 0x4016c7ef in _strtod_l (ptr=0x0, s00=<optimized out>, se=0x3ffc2ab0, loc=<optimized ou
t>) at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/newlib/newlib/libc/stdlib/strtod.c:811
  3    process 1073493384 0x4000bff0 in ?? ()
  4    process 1073544832 0x4000bff0 in ?? ()
  5    process 1073521872 0x40087f84 in ipc_task (arg=<optimized out>) at /home/dizcza/tools/esp-idf/components/e
sp_system/esp_ipc.c:76
  6    process 1073411116 0x4000bff0 in ?? ()
  7    process 1073412488 0x4000bff0 in ?? ()
  8    process 1073548800 0x4000bff0 in ?? ()
  9    process 1073537224 0x4000bff0 in ?? ()
  10   process 1073527212 0x4000bff0 in ?? ()
  11   process 1073412832 0x4000bff0 in ?? ()
  12   process 1073558820 0x4000bff0 in ?? ()

==================== THREAD 1 (TCB: 0x3ffc3ce4, name: 'IDLE') =====================
#0  0x4000bff0 in ?? ()
#1  0x40094dcb in vApplicationGetIdleTaskMemory (ppxIdleTaskTCBBuffer=0x3ffb5fac <port_uxOldInterruptState+4>, pp
xIdleTaskStackBuffer=0x3ffc3bc0, pulIdleTaskStackSize=0x4) at /home/dizcza/tools/esp-idf/components/freertos/port
_common.c:52
#2  0x400920b4 in find_entry_from_task_handle_and_check_all_reset (handle=0x0, all_reset=0x60023) at /home/dizcza
/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:146
#3  0x4009213a in prvCopyDataToQueue (pxQueue=0x40092134 <prvCopyDataToQueue>, pvItemToQueue=0x60020, xPosition=4
) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/queue.c:2356
#4  0x40092019 in esp_task_wdt_reset () at /home/dizcza/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:7
83
#5  0x40093093 in prvCheckTasksWaitingTermination () at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-K
ernel/tasks.c:4567
#6  prvIdleTask (pvParameters=<optimized out>) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/
tasks.c:4231
#7  0x40094b4d in uxInitialiseStackTLS (ret_threadptr_reg_init=<optimized out>, uxStackPointer=<optimized out>) a
t /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:263
#8  pxPortInitialiseStack (pxTopOfStack=<optimized out>, pxCode=0x0, pvParameters=0x0) at /home/dizcza/tools/esp-
idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:450

The backtrace is not informative

dizcza commented 1 year ago

I expected to see that the sdprecord_read_sensor_task WDT is triggered at line ulTaskNotifyTake(pdTRUE, portMAX_DELAY); Or at least the name of the task. Not IDLE.

In the coredump I showed, there is no such information whatsoever.

chipweinberger commented 1 year ago

Interesting. looking at the code I thought there were improvements in v5.1.

I've reopened. That's too bad.

igrr commented 1 year ago

@dizcza could you please check the ELF file SHA256 printed by your app at startup, and compare it to the SHA256 of the ELF file passed to espcoredump.py?

Unreasonable backtraces like the one you have posted (e.g. prvCheckTasksWaitingTermination most definitely doesn't call esp_task_wdt_reset) are most often the result of a mismatch between the ELF file used to produce the application binary and the ELF file used when decoding the core dump.

dizcza commented 1 year ago

@dizcza could you please check the ELF file SHA256 printed by your app at startup, and compare it to the SHA256 of the ELF file passed to espcoredump.py?

Unreasonable backtraces like the one you have posted (e.g. prvCheckTasksWaitingTermination most definitely doesn't call esp_task_wdt_reset) are most often the result of a mismatch between the ELF file used to produce the application binary and the ELF file used when decoding the core dump.

Indeed, the SHA differs:

I (1458) cpu_start: ELF file SHA256:  99af209085100ddf...
I (08:50:19.357) coredump: App SHA:
I (08:50:19.360) coredump: 33 65 33 62 64 65 36 30 66 36 30 64 32 30 39 65 

The coredump app SHA is taken from the app_elf_sha256 of the coredump summary. After I erased the flash entirely and did nothing to the app I got yet another coredump SHA:

I (00:00:01.116) coredump: App SHA:
I (00:00:01.120) coredump: 39 39 61 66 32 30 39 30 38 35 31 30 30 64 64 66

In both cases, the coredump info is like I've shown above.

My coredump procedure is the following. First, I set up the partitions like so:

nvs,      data, nvs,     ,        0x4000,
otadata,  data, ota,     ,        0x2000,
phy_init, data, phy,     ,        0x1000,
factory,  app,  factory, ,        0x127000,
ota_0,    app,  ota_0,   ,        0x127000,
ota_1,    app,  ota_1,   ,        0x127000,
coredump, data, coredump,,        64K,

Then I configured the app to dump backtraces to flash. Upon reset, I read the coredump partition and copy raw bytes to an SD card mounted to my ESP32 board. Finally, I download this binary file and run

idf.py coredump-info -c COREDUMP.BIN -p /dev/ttyACM0

The code to read a coredump and write it to an SD card:

void coredump_log_sdcard() {
    coredump_print_summary();

    size_t out_cd_addr = 0;
    size_t out_cd_size = 0;

    esp_err_t err = esp_core_dump_image_get(&out_cd_addr, &out_cd_size);
    bool erase = false;
    if (err != ESP_OK) {
        ESP_LOGI(TAG, "esp_core_dump_get_summary: %s", esp_err_to_name(err));
        erase = true;
    } else {
        ESP_LOGI(TAG, "core dump addr: %d", out_cd_addr);
        ESP_LOGI(TAG, "core dump size: %d", out_cd_size);
    }

    const char *label = COREDUMP_PARTITION_NAME;
    const esp_partition_t* partition = esp_partition_find_first(
            ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_COREDUMP, label);
    if (partition == NULL) {
        ESP_LOGI(TAG, "partition '%s' not found", label);
        return;
    }

    if (out_cd_size <= 0) {
        ESP_LOGW(TAG, "partition '%s' size: %d", label, out_cd_size);
        return;
    }

    if (erase) {
        // erase and exit
        esp_partition_erase_range(partition, 0, out_cd_size);
        return;
    }

    FILE* f = sdcard_record_open_file_once("COREDUMP.BIN", "w");
    if (f == NULL) {
        return;
    }

    char *buffer = (char*) malloc(COREDUMP_BUFF_SIZE);
    long offset = 0;
    while (out_cd_size > 0) {
        int chunk_size = out_cd_size > COREDUMP_BUFF_SIZE ? COREDUMP_BUFF_SIZE : out_cd_size;
        esp_partition_read(partition, offset, buffer, chunk_size);
        fwrite(buffer, sizeof(char), chunk_size, f);
        offset += chunk_size;
        out_cd_size -= chunk_size;
    }

    fclose(f);
    free(buffer);
}

ESP-IDF v5.1-dev-4528-g420ebd208a


Here is the output in the terminal:

E (32754) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (32754) task_wdt:  - sdp_read (CPU 1)
E (32754) task_wdt: Tasks currently running:
E (32754) task_wdt: CPU 0: IDLE
E (32754) task_wdt: CPU 1: IDLE
E (32754) task_wdt: Aborting.
E (32754) task_wdt: Print CPU 1 backtrace

Backtrace: 0x4000bfed:0x3ffc3b60 0x4009402c:0x3ffc3b70 0x4009146d:0x3ffc3b90 0x400914f3:0x3ffc3bd0 0x400913d2:0x3ffc3bf0 0x4009244c:0x3ffc3c10 0x40093dae:0x3ffc3c30
0x4009402c: vPortClearInterruptMaskFromISR at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:568
 (inlined by) vPortExitCritical at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:532

0x4009146d: esp_task_wdt_reset at /home/dizcza/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:791

0x400914f3: idle_hook_cb at /home/dizcza/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:490

0x400913d2: esp_vApplicationIdleHook at /home/dizcza/tools/esp-idf/components/esp_system/freertos_hooks.c:47 (discriminator 1)

0x4009244c: prvIdleTask at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:4280

0x40093dae: vPortTaskWrapper at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162

ELF file SHA256: 99af209085100ddf

I (5018) esp_core_dump_flash: Save core dump to flash...
I (5024) esp_core_dump_common: Backing up stack @ 0x3ffb27f0 and use core dump stack @ 0x3ffb9930
I (5034) esp_core_dump_flash: Erase flash 20480 bytes @ 0x397000
I (5197) esp_core_dump_flash: Write end offset 0x4d84, check sum length 4
I (5197) esp_core_dump_common: Core dump used 944 bytes on stack. 644 bytes left free.
I (5201) esp_core_dump_common: Restoring stack @ 0x3ffb27f0
I (5208) esp_core_dump_flash: Core dump has been saved to flash.
Rebooting...
igrr commented 1 year ago

Could you please give it a try to read the core dump directly from flash (with idf.py coredump-info), bypassing the step with the SD card?

dizcza commented 1 year ago

Could you please give it a try to read the core dump directly from flash (with idf.py coredump-info), bypassing the step with the SD card?

Same story.

igrr commented 1 year ago

Could you try to make a version of the program which you can share, that still reproduces the issue?

I just ran the test application — https://github.com/espressif/esp-idf/tree/master/tools/test_apps/system/panic — with the config from sdkconfig.ci.coredump_flash_elf_sha, and ran the task_wdt test in that application. The backtrace was decoded correctly.

dizcza commented 1 year ago

Could you try to make a version of the program which you can share, that still reproduces the issue?

I just ran the test application — https://github.com/espressif/esp-idf/tree/master/tools/test_apps/system/panic — with the config from sdkconfig.ci.coredump_flash_elf_sha, and ran the task_wdt test in that application. The backtrace was decoded correctly.

Change the infinite_loop function to

static void infinite_loop(void* arg) {
    (void) arg;

    esp_task_wdt_add(NULL);

    ulTaskNotifyTake(pdTRUE, portMAX_DELAY);

    while(1) {
        ;
    }
}

run the test_task_wdt_cpu1 test and you'll see similar output as I've shown above. Here is what I get:

Crashed task handle: 0x3ffafd60, name: 'IDLE', GDB name: 'process 1073413472'

==================== CURRENT THREAD STACK =====================
#0  0x4000bff0 in ?? ()
#1  0x4008b0ee in vPortClearInterruptMaskFromISR (prev_level=<optimized out>) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:568
#2  vPortExitCritical (mux=<optimized out>) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:532
#3  0x400dd33a in esp_task_wdt_reset () at /home/dizcza/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:791
#4  0x400dd342 in idle_hook_cb () at /home/dizcza/tools/esp-idf/components/esp_system/task_wdt/task_wdt.c:490
#5  0x400d4c47 in esp_vApplicationIdleHook () at /home/dizcza/tools/esp-idf/components/esp_system/freertos_hooks.c:47
#6  0x400ebb7d in prvIdleTask (pvParameters=0x0) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:4280
#7  0x4008aaf0 in vPortTaskWrapper (pxCode=0x400ebb74 <prvIdleTask>, pvParameters=0x0) at /home/dizcza/tools/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162

Or this is the expected output and really no way to say which task hasn't yielded to the WDT? Cause I have lots of tasks each guarded by a WDT...

igrr commented 1 year ago

The last backtrace you have posted looks reasonable if you are registering TWDT for another task: the Idle task was running (idling) and at some point got interrupted by TWDT interrupt. In this case it's indeed not possible fo say just from the backtrace, which of the tasks has not fed the watchdog.

dizcza commented 1 year ago

The last backtrace you have posted looks reasonable if you are registering TWDT for another task: the Idle task was running (idling) and at some point got interrupted by TWDT interrupt. In this case it's indeed not possible fo say just from the backtrace, which of the tasks has not fed the watchdog.

Sadly indeed.

dizcza commented 1 year ago

Well, that was the reason I posted a response here thinking that cases as I described should be more informative.

If it's a won't fix, let it be.

Or let it mark a feature request - don't know if implementing it is possible.

igrr commented 1 year ago

We have just discussed this with the team, it is probably possible to include additional information into the core dump to make such cases easier to analyze. Basically, this part of the text currently printed to console

Task watchdog got triggered. The following tasks did not reset the watchdog in time:
  - IDLE (CPU 1)
 Tasks currently running:
 CPU 0: IDLE
 CPU 1: btApiTask
 Aborting.

could also be included into the coredump, and extracted by the coredump utility on the host side. In your case, the "following tasks did not reset the watchdog in time" list would include the task which you have registered the task watchdog for.

We probably won't implement this immediately, but let's keep this as an open feature request.

dizcza commented 1 year ago

the "following tasks did not reset the watchdog in time" list would include the task which you have registered the task watchdog for.

Nice idea, I like it.

All right.