Closed MikitaMinau closed 2 years ago
Hi @MikitaMinau, Thanks for reporting the issue.
Could you provide more information so we can reproduce this issue?
xtensa-esp32-elf-addr2line -piaf -e <elf-file> <backtrace>
Probably not 100% related, but just FYI: In v4.3 branch, below needs fix: https://github.com/espressif/esp-idf/blob/release/v4.3/components/freertos/tasks.c#L3917
PS. It was fixed in master by da65a010a32c946d25bbbc4dfb590ece00215c90.
@AxelLin Thanks for pointing this out. We will backport this fix and check if this issue is related to it.
@AxelLin There is definitely an issue with eTaskConfirmSleepModeStatus that you pointed out. However, this API is not used in IDF code. So the issue @MikitaMinau is facing must be different from the one you pointed out.
@shubhamkulkarni97 Hello Shubham, Thank you for your reply.
0x4009769e: panic_abort at C:/Projects/esp-idf/components/esp_system/panic.c:356
0x40097ea9: esp_system_abort at C:/Projects/esp-idf/components/esp_system/system_api.c:112
0x4009c4ba: abort at C:/Projects/esp-idf/components/newlib/abort.c:46
0x401d7e4c: __assert_func at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdlib/assert.c:62 (discriminator 8)
0x401fddee: vTaskDelay at C:/Projects/esp-idf/components/freertos/tasks.c:1501 (discriminator 1)
0x400f1559: has_timer_expired at c:\projects\wagz-fwos-2\build/../lib/AmazonAwsPort/port/Timer.c:46
0x400f11c2: subscribeToShadowActionAcks at c:\projects\wagz-fwos-2\build/../lib/AmazonAwsPort/aws-sdk/src/aws_iot_shadow_records.c:370 (discriminator 1)
0x400f008e: aws_iot_shadow_internal_action at c:\projects\wagz-fwos-2\build/../lib/AmazonAwsPort/aws-sdk/src/aws_iot_shadow_actions.c:56
0x400effcd: aws_iot_shadow_update at c:\projects\wagz-fwos-2\build/../lib/AmazonAwsPort/aws-sdk/src/aws_iot_shadow.c:187
0x400e50f5: updateShadow at c:\projects\wagz-fwos-2\build/../src/ThingShadowService/ThingShadowService.c:948 (discriminator 13)
0x400e56d2: thingShadowTask at c:\projects\wagz-fwos-2\build/../src/ThingShadowService/ThingShadowService.c:815 (discriminator 15)
0x400982d9: vPortTaskWrapper at C:/Projects/esp-idf/components/freertos/port/xtensa/port.c:168
According the backtrace, assert was called from thingShadowTask(). This task is unpinned and has priority 3 while OTA task is pinned to core 0 and has priority 5. The task calls aws_iot_shadow_update() from AWS IoT Device Embedded SDK. The code was taken here https://github.com/espressif/esp-aws-iot.
From the logs I can confirm that Thing Shadow update function was called during the OTA.
(54058) Ota.c: Written image length 1116160
(54078) ThingShadowService.c: Amazon AWS Thing Shadow connected!
(54088) Ota.c: Written image length 1118208
(54108) ThingShadowService.c: Updating the reported section
(54108) ThingShadowService.c: Update Shadow: {"state":{"reported"..................
(54118) Ota.c: Written image length 1120256
@shubhamkulkarni97 Have you had a chance to look at the issue?
@MikitaMinau I tried to reproduce this issue but with no success.
These kind of issues are mostly observed due to heap corruption. You can refer to Heap Memory Debugging guide for methods to find heap corruption.
Heap Tracing is also a good tool to find memory leaks.
@MikitaMinau Any updates on this issue?
@shubhamkulkarni97 Hello Shubham, I added heap integrity check in light impact mode. I have not found any issue with heap memory corruption yet and I do not think I have any. However, I think I know where the root cause for the assertion might be.
Here is the vTaskDelay code from v4.3
void vTaskDelay( const TickType_t xTicksToDelay )
{
BaseType_t xAlreadyYielded = pdFALSE;
/* A delay time of zero just forces a reschedule. */
if( xTicksToDelay > ( TickType_t ) 0U )
{
configASSERT( uxSchedulerSuspended[xPortGetCoreID()] == 0 );
taskENTER_CRITICAL( &xTaskQueueMutex );
I guess the problem here is that assert is placed outside of critical section. Lest assume that unpinned task calls vTaskDelay() while on core 0 and gets preempted right after xPortGetCoreID() is called. Then, a higher priority task calls vTaskSuspendAll() on core 0. A lower priority task is unpinned and switches to core 1 but xPortGetCoreID() has already been called on core 0, so config assert calls exception because scheduler on core 0 has been suspended.
Here is the code from v4.0
void vTaskDelay( const TickType_t xTicksToDelay )
{
TickType_t xTimeToWake;
BaseType_t xAlreadyYielded = pdFALSE;
/* A delay time of zero just forces a reschedule. */
if( xTicksToDelay > ( TickType_t ) 0U )
{
configASSERT( xTaskGetSchedulerState() != taskSCHEDULER_SUSPENDED );
taskENTER_CRITICAL(&xTaskQueueMutex);
The function xTaskGetSchedulerState() has critical section inside, that is why I have never seen such issue on v4.0. xTaskGetSchedulerState() was replaced with xPortGetCoreID() in this commit. What do you think?
Also, I have found very similar bug with vTaskDelayUntil() on v3.3. You can find it here.
@MikitaMinau Thanks for analyzing this further. We will take a look and create applicable fixes for v4.3 and onward. CC @Dazza0
@mahavirj Thank you for fixing the issue. When do you think the fix could go into a release version?
@MikitaMinau
Backport MRs had been created and they passed internal tests as well. Fix on release/v4.4
should appear with next GH sync. However for release/v4.3
it may take some time, branch is currently locked for next patch release purpose.
You may used attached
freertos: fix thread safety for checking scheduler state_v4.3.zip patch for release/v4.3
interim. Kindly let us know if you face any issue with this fix.
Environment
Problem Description
The assert in vTaskDelay() failed in the middle of OTA FW image downloading process. My code does not call vTaskSuspendAll() directly, but calls vTaskDelay() quite often in different tasks with different priorities. The tasks are mostly unpinned. However, OTA task are pinned to core 0. We have been developing with v4.0 for more than a year but faces such issue only after ESP-IDF update to 4.3. Also, I have found similar issue here https://github.com/espressif/esp-idf/issues/4230.
Expected Behavior
OTA should not lead to an assert.
Actual Behavior
Assert fails during OTA FW update.
Steps to reproduce
Was reproduced only once during OTA process. 20-30 tries were made before the issue was encountered.
Code to reproduce this issue
Unfortunately, I cannot share the code (also, it requires custom HW and won't run on devkits). OTA task looks pretty much like the task in OTA esp-idf\examples\system\ota\native_ota_example.
Debug Logs
Other items if possible
sdkconfig.txt