[ESP32S3] Reading from a SPIFFS filesystem causes interrupt jitter (IDFGH-13215)

espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.

Apache License 2.0

13.28k stars 7.2k forks source link

[ESP32S3] Reading from a SPIFFS filesystem causes interrupt jitter (IDFGH-13215) #14152

Open higaski opened 2 months ago

higaski commented 2 months ago

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

I have noticed that reading from a SPIFFS filesystem via read or fread system calls introduces noticeable jitter to interrupts, even if those interrupts have been given the highest user priority (level 3) and are running from IRAM.

Given the architecture diagrams from the ESP32-S3 reference manuals, can someone explain to me why this happens system_structure_and_address_mapping

cache_structure

/edit Here is an entire project which makes this behavior quite visible: https://github.com/higaski/esp_idf_issue_14152

or zipped esp_idf_issue_14152.zip

It is running on an ESP32-S3-DevKitC-1 and loads a ~300kB png file from SPIFFS while also running a ~20kHz timer with a pin toggle. As long as this read process is going on there can be up to an additional 6µs of jitter. If the file does not get red, then the interrupt produces a perfectly stable square wave.

igrr commented 1 month ago

This probably is related with the fact that code involved in performing Flash operations occasionally uses critical sections (portENTER/EXIT_CRITICAL), which disable interrupts up to level 3. The code inside those critical sections is relatively short, but it is still enough to cause a few microseconds of interrupt jitter.

higaski commented 1 month ago

This probably is related with the fact that code involved in performing Flash operations occasionally uses critical sections (portENTER/EXIT_CRITICAL), which disable interrupts up to level 3. The code inside those critical sections is relatively short, but it is still enough to cause a few microseconds of interrupt jitter.

Any chance we could get a user level interrupt which does not get disabled?

There technically there should be no reason for my timer interrupt to get disabled, yet I'm paying the price.

igrr commented 1 month ago

If the real ISR in your application is as simple as in the example project and doesn't call any RTOS functions, you can try to rewrite it in assembly and register it as a high-priority interrupt (level 4). These interrupts do not get disabled in RTOS critical sections and could be used for simple tasks, like toggling a GPIO.

You could also try making the ISR registration call from CPU 1, in which case the interrupt will be serviced there. This won't completely eliminate jitter since RTOS critical sections are still occasionally used on CPU1, but might reduce jitter to an acceptable level.

higaski commented 1 month ago

Sadly it's not that simple no. My real application calls functions from the GPTimer und UART driver and sends a task notification...

// RMT interrupt, start timer
bool IRAM_ATTR rmt_callback(rmt_channel_handle_t,
                            rmt_tx_done_event_data_t const*,
                            void*) {
  gptimer_set_raw_count(gptimer, 0ull);
  gptimer_alarm_config_t const alarm_config{
    .alarm_count = static_cast<decltype(gptimer_alarm_config_t::alarm_count)>(
      TCSMin + offsets.tcs)};
  gptimer_set_alarm_action(gptimer, &alarm_config);
  return pdFALSE;
}

// Timer interrupt, disable outputs, read available UART bytes
bool IRAM_ATTR gptimer_callback(gptimer_handle_t timer,
                                gptimer_alarm_event_data_t const* edata,
                                void*) {
  static bool ch1{};
  bool const ch2{true};
  BaseType_t high_task_awoken{pdFALSE};

  // TCS
  if (edata->alarm_value < TTS1) {
    gptimer_set_raw_count(timer, TCSMin);

    // Pull tracks low
    gpio_set_level(bidi_en_gpio_num, 1u);

    // Reset alarm to TS2
    gptimer_alarm_config_t const alarm_config{.alarm_count = TTC1};
    gptimer_set_alarm_action(timer, &alarm_config);
  }
  // TS2
  else if (edata->alarm_value < TTS2) {
    // Reset alarm to TCE
    gptimer_alarm_config_t const alarm_config{.alarm_count = TCEMin};
    gptimer_set_alarm_action(timer, &alarm_config);

    // Check whether there has been data in channel 1
    ch1 = uart_ll_get_rxfifo_len(&UART1);
  }
  // TCE
  else {
    // Release tracks
    gpio_set_level(bidi_en_gpio_num, 0u);

    // Send a notification to the task
    xTaskNotifyIndexedFromISR(task.handle,
                              default_notify_index,
                              (ch2 << 1u) | ch1,
                              eSetValueWithOverwrite,
                              &high_task_awoken);

    gptimer_set_alarm_action(gptimer, NULL);
  }

  return high_task_awoken == pdTRUE;
}

/edit1 Another way to work around this issue would be to give users the ability to specify the priority level which should be disabled during flash reads, at least between >3 and >=3). Typically users would already have specified that their interrupts and peripheral drivers are running from IRAM anyway.

/edit2 Oh, and I already run all my application code con core 1. Core 0 is solely used by WiFi (HTTP/UDP).

igrr commented 1 month ago

Could you try enabling CONFIG_SPI_FLASH_AUTO_SUSPEND and see if that improves the situation? It depends on the flash chip supporting auto-suspend, but if it does, flash reading process is simplified and there will probably be less critical sections involved.

Another way to work around this issue would be to give users the ability to specify the priority level which should be disabled during flash reads, at least between >3 and >=3).

I'm afraid there are could be some risks with such a change. If the developer changes this setting during development and forgets about it, they could later run into race conditions which are hard to debug. It's not easy to check automatically that the ISR is not accessing anything that is not locked by a critical section with a matching interrupt masking level. Besides, currently all our critical sections disable interrupts up to the same level, so we would need to introduce a new critical section API with a configurable level. We'll discuss this internally and see what could be done.

higaski commented 1 month ago

Could you try enabling CONFIG_SPI_FLASH_AUTO_SUSPEND and see if that improves the situation? It depends on the flash chip supporting auto-suspend, but if it does, flash reading process is simplified and there will probably be less critical sections involved.

CONFIG_SPI_FLASH_AUTO_SUSPEND doesn't seem to be supported on the WROOM2 I'm using: E (1153) chip_generic: The flash you use doesn't support auto suspend, only 'XMC' is supported

Another way to work around this issue would be to give users the ability to specify the priority level which should be disabled during flash reads, at least between >3 and >=3).

I'm afraid there are could be some risks with such a change. If the developer changes this setting during development and forgets about it, they could later run into race conditions which are hard to debug. It's not easy to check automatically that the ISR is not accessing anything that is not locked by a critical section with a matching interrupt masking level. Besides, currently all our critical sections disable interrupts up to the same level, so we would need to introduce a new critical section API with a configurable level. We'll discuss this internally and see what could be done.

I can see how such a change might now be very impactful to the codebase, but I disagree that it is harder to debug than any of the peculiarities which already exist anyhow. My favorite example is still the single constexpr variable which gets accessed inside an IRAM function and causes some kind of cache panic... I think I myself opened multiple issues on such cases over the past few months, some of which led to bug fixes.

In my opinion, such a feature could help the ESP32 to shed its image of not being "real-time" capable, at least a little. So far whenever I got asked about interrupt jitter I told people that they haven't setup their cores and peripherals correctly yet which admittedly is very very hard. But at least it's possible. Getting a peripherals interrupt disabled because of a critical section which would technically not be necessary though is something I currently can't change.

igrr commented 1 month ago

that it is harder to debug than any of the peculiarities which already exist anyhow. My favorite example is still the single constexpr variable which gets accessed inside an IRAM function and causes some kind of cache panic

If an issue leads to a panic, and the panic points to the function which needs to be changed, this is more straightforward to debug than if you accidentally corrupt RTOS data structures by not locking the interrupts and then (maybe indirectly) calling an RTOS function from an interrupt handler. In the latter case, the crash will not directly point to the problematic function, and will most likely result in a bug report similar to "IDF FreeRTOS crashes" — which IDF developers will then need to reproduce and debug. (You do provide us with nice reproducible examples which makes this easier, but the unfortunate fact is that not all customers do the same!)

I do agree that having the ISRs less affected by the system would be a nice improvement, we'll discuss this and see what can be done.

higaski commented 1 month ago

Fair point. I'm looking forward to your reply.

sudeep-mohanty commented 4 weeks ago

Hi @higaski, Thank you for providing a very useful example to reproduce the problem. From our initial investigation, it looks like most of the jitter is caused by the fopen() and fclose() calls during the SPIFFS read operation. For a test, if your application allows it, would it be possible for you to maintain the file descriptor outside of your read loop so that you don't have to open/close it each time? Could you then assess its impact on the interrupt jitter you observe in your application? Thanks!

higaski commented 4 weeks ago

Hi @higaski, Thank you for providing a very useful example to reproduce the problem. From our initial investigation, it looks like most of the jitter is caused by the fopen() and fclose() calls during the SPIFFS read operation. For a test, if your application allows it, would it be possible for you to maintain the file descriptor outside of your read loop so that you don't have to open/close it each time? Could you then assess its impact on the interrupt jitter you observe in your application? Thanks!

My application would require more than just a single file though. I'm not terrible familiar with C file handlers, but I assume you'd need one per file.

I could maybe gzip the whole web-app I'm actually transferring (dozens of files currently), but this is still more a workaround rather than a solution.

sudeep-mohanty commented 4 weeks ago

@higaski What I meant was that, you could open the file once with fopen(), perform all your operations on the file, and then close the file with one fclose(). Irrespective of the number of file descriptors you need, opening and closing files multiple times can result in increased interrupt jitter. I'd like to know, if indeed reducing the number of fopen() and fclose() operations reduces the jitter you observe. That could help us narrow down to an optimal solution.

higaski commented 4 weeks ago

@higaski What I meant was that, you could open the file once with fopen(), perform all your operations on the file, and then close the file with one fclose(). Irrespective of the number of file descriptors you need, opening and closing files multiple times can result in increased interrupt jitter. I'd like to know, if indeed reducing the number of fopen() and fclose() operations reduces the jitter you observe. That could help us narrow down to an optimal solution.

I've changed my example to use a global file descriptor which opens the file just once and never closes it again:

// Read .png file from SPIFFS
void read_file() {
  static std::array<uint8_t, 1024uz> file_buffer;
  if (fd == NULL) {
    fd = fopen("/logo.png", "r");
    if (fd == NULL) {
      ESP_LOGE(TAG, "Failed to open file for reading");
      return;
    }
  }
  rewind(fd);
  size_t chunksize;
  do {
    if (!(chunksize = read(fileno(fd), data(file_buffer), size(file_buffer))))
      continue;
  } while (chunksize);
}

Estimated jitter looks about the same as before.

sudeep-mohanty commented 4 weeks ago

Thanks for trying out @higaski. We shall continue the investigation and keep you posted.