espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.6k stars 7.27k forks source link

Assert failed: "uxSchedulerSuspended[xPortGetCoreID()] == 0" in function vTaskDelay (IDFGH-6041) #7726

Closed MikitaMinau closed 2 years ago

MikitaMinau commented 3 years ago

Environment

Problem Description

The assert in vTaskDelay() failed in the middle of OTA FW image downloading process. My code does not call vTaskSuspendAll() directly, but calls vTaskDelay() quite often in different tasks with different priorities. The tasks are mostly unpinned. However, OTA task are pinned to core 0. We have been developing with v4.0 for more than a year but faces such issue only after ESP-IDF update to 4.3. Also, I have found similar issue here https://github.com/espressif/esp-idf/issues/4230.

Expected Behavior

OTA should not lead to an assert.

Actual Behavior

Assert fails during OTA FW update.

Steps to reproduce

Was reproduced only once during OTA process. 20-30 tries were made before the issue was encountered.

Code to reproduce this issue

Unfortunately, I cannot share the code (also, it requires custom HW and won't run on devkits). OTA task looks pretty much like the task in OTA esp-idf\examples\system\ota\native_ota_example.

Debug Logs

(27638) Ota.c: WiFi connected
(27648) Ota.c: Starting OTA
(27658) Ota.c: Running partition type 0 subtype 16 (offset 0x00010000)
(27668) Ota.c: OTA URL is http://pupdates.wagz.com/collar/collar-23/develop/wagzfwos.bin
(28188) Ota.c: Writing to partition subtype 17 at offset 0x330000
(28208) Ota.c: New firmware version: 1.10.0
(28208) Ota.c: Running firmware version: 1.7.0
(38028) Ota.c: esp_ota_begin succeeded
(38038) Ota.c: Written image length 2048
(38058) Ota.c: Written image length 4096
(38078) Ota.c: Written image length 6144
(38328) Ota.c: Written image length 8192
............................................................
(55878) Ota.c: Written image length 1189888
(55918) Ota.c: Written image length 1191936
(55958) Ota.c: Written image length 1193984
assertion "uxSchedulerSuspended[xPortGetCoreID()] == 0" failed: file "C:/Projects/esp-idf/components/freertos/tasks.c", line 1501, function: vTaskDelay

abort() was called at PC 0x401d7e4c on core 1

Backtrace:0x4009769e:0x3ffefc80 0x40097ea9:0x3ffefca0 0x4009c4ba:0x3ffefcc0 0x401d7e4c:0x3ffefd30 0x401fddee:0x3ffefd60 0x400f1559:0x3ffefd80 0x400f11c2:0x3ffefda0 0x400f008e:0x3ffefdd0 0x400effcd:0x3ffefe70 0x400e50f5:0x3ffefea0 0x400e56d2:0x3ffeff60 0x400982d9:0x3ffeff90

ELF file SHA256: 41089e61bedb84d8

Rebooting...
(622) psram: This chip is ESP32-D0WD
(623) spiram: Found 64MBit SPI RAM device
(623) spiram: SPI RAM mode: flash 40m sram 40m

Other items if possible

sdkconfig.txt

shubhamkulkarni97 commented 3 years ago

Hi @MikitaMinau, Thanks for reporting the issue.

Could you provide more information so we can reproduce this issue?

AxelLin commented 3 years ago

Probably not 100% related, but just FYI: In v4.3 branch, below needs fix: https://github.com/espressif/esp-idf/blob/release/v4.3/components/freertos/tasks.c#L3917

PS. It was fixed in master by da65a010a32c946d25bbbc4dfb590ece00215c90.

shubhamkulkarni97 commented 3 years ago

@AxelLin Thanks for pointing this out. We will backport this fix and check if this issue is related to it.

shubhamkulkarni97 commented 3 years ago

@AxelLin There is definitely an issue with eTaskConfirmSleepModeStatus that you pointed out. However, this API is not used in IDF code. So the issue @MikitaMinau is facing must be different from the one you pointed out.

MikitaMinau commented 2 years ago

@shubhamkulkarni97 Hello Shubham, Thank you for your reply.

From the logs I can confirm that Thing Shadow update function was called during the OTA.

(54058) Ota.c: Written image length 1116160
(54078) ThingShadowService.c: Amazon AWS Thing Shadow connected!
(54088) Ota.c: Written image length 1118208
(54108) ThingShadowService.c: Updating the reported section
(54108) ThingShadowService.c: Update Shadow: {"state":{"reported"..................
(54118) Ota.c: Written image length 1120256
MikitaMinau commented 2 years ago

@shubhamkulkarni97 Have you had a chance to look at the issue?

shubhamkulkarni97 commented 2 years ago

@MikitaMinau I tried to reproduce this issue but with no success.

These kind of issues are mostly observed due to heap corruption. You can refer to Heap Memory Debugging guide for methods to find heap corruption.

Heap Tracing is also a good tool to find memory leaks.

shubhamkulkarni97 commented 2 years ago

@MikitaMinau Any updates on this issue?

MikitaMinau commented 2 years ago

@shubhamkulkarni97 Hello Shubham, I added heap integrity check in light impact mode. I have not found any issue with heap memory corruption yet and I do not think I have any. However, I think I know where the root cause for the assertion might be.

Here is the vTaskDelay code from v4.3

void vTaskDelay( const TickType_t xTicksToDelay )
    {
    BaseType_t xAlreadyYielded = pdFALSE;

        /* A delay time of zero just forces a reschedule. */
        if( xTicksToDelay > ( TickType_t ) 0U )
        {
            configASSERT( uxSchedulerSuspended[xPortGetCoreID()] == 0 );
            taskENTER_CRITICAL( &xTaskQueueMutex );

I guess the problem here is that assert is placed outside of critical section. Lest assume that unpinned task calls vTaskDelay() while on core 0 and gets preempted right after xPortGetCoreID() is called. Then, a higher priority task calls vTaskSuspendAll() on core 0. A lower priority task is unpinned and switches to core 1 but xPortGetCoreID() has already been called on core 0, so config assert calls exception because scheduler on core 0 has been suspended.

Here is the code from v4.0

void vTaskDelay( const TickType_t xTicksToDelay )
    {
    TickType_t xTimeToWake;
    BaseType_t xAlreadyYielded = pdFALSE;

        /* A delay time of zero just forces a reschedule. */
        if( xTicksToDelay > ( TickType_t ) 0U )
        {
            configASSERT( xTaskGetSchedulerState() != taskSCHEDULER_SUSPENDED );
            taskENTER_CRITICAL(&xTaskQueueMutex);

The function xTaskGetSchedulerState() has critical section inside, that is why I have never seen such issue on v4.0. xTaskGetSchedulerState() was replaced with xPortGetCoreID() in this commit. What do you think?

Also, I have found very similar bug with vTaskDelayUntil() on v3.3. You can find it here.

mahavirj commented 2 years ago

@MikitaMinau Thanks for analyzing this further. We will take a look and create applicable fixes for v4.3 and onward. CC @Dazza0

MikitaMinau commented 2 years ago

@mahavirj Thank you for fixing the issue. When do you think the fix could go into a release version?

mahavirj commented 2 years ago

@MikitaMinau

Backport MRs had been created and they passed internal tests as well. Fix on release/v4.4 should appear with next GH sync. However for release/v4.3 it may take some time, branch is currently locked for next patch release purpose.

You may used attached freertos: fix thread safety for checking scheduler state_v4.3.zip patch for release/v4.3 interim. Kindly let us know if you face any issue with this fix.