espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.72k stars 7.3k forks source link

system crash when using VFS+FatFS upon SPI bus error (IDFGH-6614) #8256

Open atlascoder opened 2 years ago

atlascoder commented 2 years ago

Environment

Problem Description

I have a board with SD chip conneted to it vis SPI bus using the pins:

define PIN_NUM_MISO GPIO_NUM_16

define PIN_NUM_MOSI GPIO_NUM_4

define PIN_NUM_CLK GPIO_NUM_5

define PIN_NUM_CS GPIO_NUM_18

define PIN_NUM_POW GPIO_NUM_2

Using the SD is problematic, for example, SD card hangs after writing 512 files into directory and requires to repower it and format in order to use it again. I tested with other SD cards - problem is the same. I suspect an issue in FatFS component. I could accept this if there would be a workaround that allowed to remount and format the SD card somehou without system crash or restart - this is critical. But I have no luck to get this.

The problem is that after any failed FatFS operation - system crashes on the next one. It panics on assert in freertos/queue.c:

BaseType_t xQueueGenericSend( QueueHandle_t xQueue,
                              const void * const pvItemToQueue,
                              TickType_t xTicksToWait,
                              const BaseType_t xCopyPosition )
{
    BaseType_t xEntryTimeSet = pdFALSE, xYieldRequired;
    TimeOut_t xTimeOut;
    Queue_t * const pxQueue = xQueue;

    configASSERT( pxQueue ); // <<< monitor reports crash here
    configASSERT( !( ( pvItemToQueue == NULL ) && ( pxQueue->uxItemSize != ( UBaseType_t ) 0U ) ) ); // but is must be here, because pvItemToQueue == NULL
    configASSERT( !( ( xCopyPosition == queueOVERWRITE ) && ( pxQueue->uxLength != 1 ) ) );
    #if ( ( INCLUDE_xTaskGetSchedulerState == 1 ) || ( configUSE_TIMERS == 1 ) )
        {
            configASSERT( !( ( xTaskGetSchedulerState() == taskSCHEDULER_SUSPENDED ) && ( xTicksToWait != 0 ) ) );
        }
    #endif

The system is not amn isolated example - the problem with asserts appears in the system with several tasks with pretty intensive operations. But the issue with FatFS limit of 512 file was confirmed on an isolated FW example (I can share it if needed).

Expected Behavior

System should not crash after any FatFS failure, allowing the system to handle the error somehow.

Actual Behavior

System crashes after failing FatFS invocation - unpon next execution:

if(rename(name1,name2)) {
    remove(name1);
{

crashes on execution remove(name1) - it doesn't appear on an isolated example!

Steps to reproduce

Can't share publicly.

Code to reproduce this issue

Debug Logs

assert failed: 0x400866b1
0x400866b1: xQueueGenericSend at /Users/ant/DevTools/esp/esp-idf/components/freertos/queue.c:820 (discriminator 1)

Backtrace:0x40081622:0x3ffcfda00x40085fc9:0x3ffcfdc0 0x4008ae36:0x3ffcfde0 0x400866b1:0x3ffcfe20 0x400d8d82:0x3ffcfe60 0x400d72e9:0x3ffcfe80 0x400d850b:0x3ffcfea0 0x400d958a:0x3ffcff50 0x400d30fd:0x3ffcff90 0x400e5a2c:0x3ffcffb0 0x400d5e9a:0x3ffcffd0 0x400d61c8:0x3ffd0030 0x400d5c2a:0x3ffd0060 
0x40081622: panic_abort at /Users/ant/DevTools/esp/esp-idf/components/esp_system/panic.c:402

0x40085fc9: esp_system_abort at /Users/ant/DevTools/esp/esp-idf/components/esp_system/esp_system.c:121

0x4008ae36: __assert_func at /Users/ant/DevTools/esp/esp-idf/components/newlib/assert.c:47

0x400866b1: xQueueGenericSend at /Users/ant/DevTools/esp/esp-idf/components/freertos/queue.c:820 (discriminator 1)

0x400d8d82: ff_rel_grant at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/port/freertos/ffsystem.c:105

0x400d72e9: unlock_fs at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/src/ff.c:921
 (inlined by) unlock_fs at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/src/ff.c:915

0x400d850b: f_rename at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/src/ff.c:5046

0x400d958a: vfs_fat_rename at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/vfs/vfs_fat.c:702

0x400d30fd: esp_vfs_rename at /Users/ant/DevTools/esp/esp-idf/components/vfs/vfs.c:653 (discriminator 3)

0x400e5a2c: rename at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/rename.c:60

0x400d5e9a: rotate_file at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger_file.c:237 (discriminator 6)

0x400d61c8: LoggerFile_saveBuffer at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger_file.c:132

0x400d5c2a: process_buffer at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger.c:282
 (inlined by) logger_task at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger.c:95

ELF file SHA256: edd9275748635b92

Entering gdb stub now.
$T0b#e6GNU gdb (crosstool-NG esp-2021r2) 9.2.90.20200913-git
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-host_apple-darwin12 --target=xtensa-esp32-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/build/GX5v210620.elf...
Remote debugging using /dev/cu.usbserial-0001
0x40081625 in panic_abort (details=0x3ffcfde0 "assert failed: 0x400866b1") at /Users/ant/DevTools/esp/esp-idf/components/esp_system/panic.c:402
402         *((volatile int *) 0) = 0; // NOLINT(clang-analyzer-core.NullDereference) should be an invalid operation on targets
(gdb) bt
#0  0x40081625 in panic_abort (details=0x3ffcfde0 "assert failed: 0x400866b1") at /Users/ant/DevTools/esp/esp-idf/components/esp_system/panic.c:402
#1  0x40085fcc in esp_system_abort (details=0x3ffcfde0 "assert failed: 0x400866b1") at /Users/ant/DevTools/esp/esp-idf/components/esp_system/esp_system.c:121
#2  0x4008ae39 in __assert_func (file=<optimized out>, line=0, func=0x0, expr=0x0) at /Users/ant/DevTools/esp/esp-idf/components/newlib/assert.c:47
#3  0x400866b4 in xQueueGenericSend (xQueue=0x3ffd63f0, pvItemToQueue=0x0, xTicksToWait=<optimized out>, xCopyPosition=0) at /Users/ant/DevTools/esp/esp-idf/components/freertos/queue.c:820
#4  0x400d8d85 in ff_rel_grant (sobj=0x3ffd63f0) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/port/freertos/ffsystem.c:105
#5  0x400d72ec in unlock_fs (res=FR_OK, fs=0x3ffe49cc) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/src/ff.c:921
#6  unlock_fs (fs=0x3ffe49cc, res=FR_OK) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/src/ff.c:915
#7  0x400d850e in f_rename (path_old=<optimized out>, path_new=<optimized out>) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/src/ff.c:5046
#8  0x400d958d in vfs_fat_rename (ctx=0x3ffe49ac, src=<optimized out>, dst=<optimized out>) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs/vfs/vfs_fat.c:702
#9  0x400d3100 in esp_vfs_rename (r=<optimized out>, src=<optimized out>, dst=0x3ffcffe0 "/sdcard/000002BE.LOG") at /Users/ant/DevTools/esp/esp-idf/components/vfs/vfs.c:653
#10 0x400e5a2f in rename (old=0x3ffca4f8 <_filename> "/sdcard/000002BE.TMP", new=0x3ffcffe0 "/sdcard/000002BE.LOG")
    at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/rename.c:60
#11 0x400d5e9d in rotate_file () at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger_file.c:237
#12 0x400d61cb in LoggerFile_saveBuffer (
    buf=0x3ffd2120 "D (703569) MDM: Endl open found\nD (703570) LOG2F: saving buffer, filepos=7974, buflen=7940\n, heap=98412\nD (703578) spi_master: device5 locked the bus\nD (703582) spi_master: device5 release bus\nD (7035"..., len=7937) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger_file.c:132
#13 0x400d5c2d in process_buffer (buf=0x3ffb24e8 <_buffers>) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger.c:282
#14 logger_task (arg=<optimized out>) at /Users/ant/DevProjects/fleetsu/gx5s-poc-v2/components/logger/logger.c:95
(gdb) 

Other items if possible

igrr commented 2 years ago

Hi @atlascoder, thanks for reporting the issue!

Using the SD is problematic, for example, SD card hangs after writing 512 files into directory and requires to repower it and format in order to use it again.

Do you happen to be creating files in the root directory of the card? 512 sounds like the limit of the number of entries in the FAT root directory. https://github.com/espressif/esp-idf/blob/187e6ff66f653ac21fb6ab67e3cc8cc3a571b032/components/fatfs/src/ff.c#L5540 Although it's a bit odd that you would see this limitation — on FAT32 the root directory content is stored in the data region, so the limit should only apply to FAT16. And I assume that any SD card these days is too large for FAT16. (Edit: I stand corrected, a 2GB card could be formatted in FAT16) Perhaps you could make a dump of the filesystem from your card when the issue happens, and check the FAT structure with something like https://github.com/maxpat78/FATtools?

Regarding the crash in f_rename/unlock_fs functions, the assertion in FreeRTOS happens because the mutex pointer being passed to xQueueGenericSend is NULL. If you have a GDB Stub session, you can check whether this is the case — please look for s_fat_ctxs[0].fs->sobj. I see two possibilities how sobj could become NULL:

  1. FatFS is deinitialized while another operation (rotate_file in your case) was in progress. esp_vfs_fat_unregister_path function is not thread-safe, so if for any reason it is called while another FatFS operation is in progress, something like what you observe may happen.
  2. A memory corruption in some other code could overwrite sobj member of struct FATFS. This is possible since sobj is right after lfnbuf which is used by some string functions, so a buffer overflow can't be ruled out. You can try the following to check this: add some dummy entry in struct FATFS (in ff.h) right before sobj. For example, uint32_t unused[32];. If the issue stops being reproducible, then it means that out-of-bounds write was indeed the cause.

I've also noticed that the backtrace refers to fsu_fatfs component in your project directory - DevProjects/fleetsu/gx5s-poc-v2/components/fsu_fatfs. Is the issue reproducible with fatfs component in ESP-IDF? If not, could you describe the changes between your version of fatfs and the one in ESP-IDF? A diff file would be the best option.

atlascoder commented 2 years ago

Hi, @igrr , thank you very much for so quick and comprehensive response!

Regarding the 512 limit - you are right, this must be only root folder limit. Though, I checked this before and faced 512 limit as well, but now, I made another check and I can write more that 512 files to non-root folder.

I have tested on a card with that was perviousely formatted in exFAT but 512 limit was also active. When I format the SD card by internal FatFS routne - it becomes Fat16 format.

Thank you for the clarification about this limit!

Regarding the crash - your recommendations are very close to my thoughts! I suspected that there is some memory corruption because of some non thread safe routine. I have another task in parallel that operates with file intensively. But I am not sure at the moment that this is esp_vfs_fat_unregister_path, but I will check. As well as your brilliant advice how to check lfnbuf overflow.

I'll post results later.

Thank you very much!!

atlascoder commented 2 years ago

Ah, @igrr , one detail I left aside but in your perspective it may look important. Before the crashes I usually observe erroneous responses from SDSPI driver. I understant, this info is not enough for analysis, but how do you think: can SPI driver make such memory corruption? For example, in a case when SD card stops to reply SD requests.