espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.65k stars 7.29k forks source link

Race condition when CONFIG_SPI_FLASH_SHARE_SPI1_BUS is enabled (IDFGH-3546) #5496

Open devanlai opened 4 years ago

devanlai commented 4 years ago

Environment

Problem Description

When CONFIG_SPI_FLASH_SHARE_SPI1_BUS is enabled - whether on a WROOM module with only the main flash or on a WROVER module with main flash + PSRAM sharing SPI1 - I've observed a race condition where one of the threads waiting to access the main flash (for example, to read from NVS or to perform an OTA operation) can end up sleeping forever waiting to acquire the SPI1 bus lock. This persists even if no other threads are holding the SPI1 bus lock.

I've prepared a minimal test case that can be used to reproduce the issue on WROOM module on a DevKitC, though I originally encountered this issue on a custom PCBA that was using a WROVER module.

The following config options are changed from the defaults:

  1. CONFIG_SPI_FLASH_SHARE_SPI1_BUS=y - obviously needed to reproduce this issue
  2. CONFIG_FREERTOS_USE_TRACE_FACILITY=y and CONFIG_FREERTOS_USE_STATS_FORMATTING_FUNCTIONS=y - to allow for interactive task debugging from the serial port.

In the linked test case project, it spawns two threads that loop forever doing the following:

  1. Read a random 1k chunk from the nvs partition (chosen because it exists by default)
  2. Every 100 reads, output a log message to indicate the thread is still active.
  3. Update a global table with the last time the thread completed a read.
  4. Delay for a short, random amount of time

As an additional question, from reading the documentation on CONFIG_SPI_FLASH_SHARE_SPI1_BUS, I'm still uncertain if I even need to enable it for my application that is using a WROVER module with PSRAM. The PSRAM is a device that is shared on SPI1, but it's not an extra device that I'm accessing through the esp_flash APIs - do I need to enable CONFIG_SPI_FLASH_SHARE_SPI1_BUS for correctness even if I don't add an external flash device outside of the module?

Expected Behavior

The two threads should ideally both get chances to run and read from the main flash. It is possible for one thread to monopolize the CPU since they have equal priorities, but if that thread is terminated, then the other thread should have a chance to run and we should see that the other thread is able to read from flash successfully.

Actual Behavior

One thread is able to read from flash, while the other sleeps indefinitely. Using a shell command to signal the thread that is successfully reading to exit shows that the thread has exited, but the other thread continues to sleep in the blocked state, even though there is no-one to contest access to the SPI1 bus lock.

Halting the CPU with a debugger over JTAG shows that the blocked thread's stack trace is the following:

Thread #14 1073456156 (Name: test1) (Suspended : Container) 
    esp_crosscore_int_send() at crosscore_int.c:109 0x400814fa  
    esp_crosscore_int_send_yield() at crosscore_int.c:115 0x40081554    
    xQueueGenericReceive() at cpu_ll.h:39 0x40088480    
    dev_wait() at spi_bus_lock.c:681 0x40083ca2 
    spi_bus_lock_acquire_start() at spi_bus_lock.c:681 0x40083ca2   
    spi_start() at spi_flash_os_func_app.c:63 0x40083873    
    spi1_start() at spi_flash_os_func_app.c:80 0x40083888   
    spiflash_start() at esp_flash_api.c:77 0x40083e90   
    esp_flash_read() at esp_flash_api.c:544 0x4008376a  
    esp_partition_read() at partition.c:347 0x400d61e5  
    <...more frames...> 

The implementation in spi_bus_lock.c looks quite complicated, but it seems that it is getting stuck sleeping on waiting for the semaphore to be given, even after the semaphore has already been given and is available.

Steps to reproduce

  1. Build and flash the linked minimal test case to reproduce the issue
  2. Over the serial port, observe that only one of either test1 or test2 is printing "Still alive" messages after the first ten seconds or so.
  3. Input the status command via the serial port to confirm that the silent thread has not updated in a while
  4. Input the spiflash_test stop <n> command to stop the active thread - either 1 for test1 or 2 for `test2.
  5. Input the tasks command to confirm that the thread has exited
  6. Wait a little while then input the status command to confirm that the other thread still has not updated.
  7. Input the tasks command and note that the sleeping thread has still never run.
  8. Disable CONFIG_SPI_FLASH_SHARE_SPI1_BUS in the sdkconfig
  9. Rebuild and reflash the project
  10. Observe that both threads run happily.

Code to reproduce this issue

See the gist up here: https://gist.github.com/devanlai/4429ca670538c91df570e7bae951e6ac

The important code is in app_main.c, but all of the support files are also included.

Debug Logs

Sample serial log output (when manually entering commands to test the behavior):

esp32$ I (5917) test2: Still alive
tasks
Task Name   Status  Prio    HWM Task#         Time  Pcnt
Shell Task      READY   1   6724    12       17342  0%
IDLE1           READY   0   1108    7      8118089  49%
IDLE0           READY   0   1108    6      8059209  49%
test2           BLOCKD  5   2468    14       66704  0%
Tmr Svc         BLOCKD  1   1588    8           47  0%
test1           BLOCKD  5   2472    13       12355  0%
ipc0            BLOCKD  24  540 2        16149  0%
ipc1            BLOCKD  24  552 3        52568  0%
esp_timer       BLOCKD  22  3636    1           29  0%
I (10847) test2: Still alive
esp32$ status
Thread 1: last updated 9320 ms ago
Thread 2: last updated 30 ms ago
I (15477) test2: Still alive
esp32$ spiflash_test stop 2
I (19807) main: Notifying test2 to stop
I (19817) test2: Exiting after 397 successful reads
esp32$ tasks
Task Name   Status  Prio    HWM Task#         Time  Pcnt
Shell Task      READY   1   6388    12       34561  0%
IDLE0           READY   0   1108    6     28492796  49%
IDLE1           READY   0   1108    7     28552209  49%
Tmr Svc         BLOCKD  1   1588    8           47  0%
ipc0            BLOCKD  24  540 2        16149  0%
ipc1            BLOCKD  24  552 3       110903  0%
esp_timer       BLOCKD  22  3636    1           29  0%
test1           BLOCKD  5   2472    13       12355  0%
esp32$ status
Thread 1: last updated 30140 ms ago
Thread 2: last updated 13310 ms ago
esp32$ 
Alvin1Zhang commented 4 years ago

Thanks for the detailed report, we will look into.