espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.46k stars 7.25k forks source link

Why does SPIRAM_RODATA affect PSRAM -> PSRAM performance so much? (IDFGH-13750) #14612

Open SimonKagstrom opened 1 week ago

SimonKagstrom commented 1 week ago

Answers checklist.

General issue report

esp-idf version: 5.3

I'm having trouble understanding the performance implications of flash vs PSRAM. In particular, the SPIRAM_RODATA (Move Read-Only Data in Flash to PSRAM) option improves performance a lot, but I can't understand why. My application looks like this:

With a smaller map, I'm able to test with the SPIRAM_RODATA option, but with the full map, there's not enough space.

Observations:

I can't really understand the difference though: While flash accesses are slower, that should really only be the compressed PNG data, which is only done when needed. Most of the time, there is really only a PSRAM -> PSRAM copy.

Can someone explain why this difference shows up? Is there something else stored in .rodata that affects performance this much? Or is it a cache effect?

Moving instructions from flash to PSRAM also helps, but the effect is not that pronounced.

Icarus113 commented 1 week ago

May I know whcih ESP module you are using? Is it an octal psram with a quad flash one? And how do you configure the flash and psram? Maybe you can post the sdkconfig as well.

SimonKagstrom commented 1 week ago

Hi! It's an esp32s3, an Adafruit Qualia ESP32-S3, so 8MiB of octal PSRAM.

Here's the sdkconfig:

sdkconfig.txt

Icarus113 commented 1 week ago

It is set to 80Mhz psram and 40Mhz flash, and psram is an octal one, flash is a quad one. So the psram speed is much higher than the flash.

And what does your app do with this part?

PNG images are stored in .rodata, forming tiles in a map. For the full map there's ~12MB of them.

On ESP32S3, Flash and PSRAM are sharing the same bus (SPI0), there is an arbitrator between flash and psram. So if the accesses to flash (slower) are re-directed to psram (faster), the performance could be better.

SimonKagstrom commented 1 week ago

Thanks a lot for the investigation!

And what does your app do with this part?

PNG images are stored in .rodata, forming tiles in a map. For the full map there's ~12MB of them.

When drawing the display, the process is basically

The UI code is here:

https://github.com/SimonKagstrom/maelir/blob/main/src/ui/ui.cc#L60

and the display code is here:

https://github.com/SimonKagstrom/maelir/blob/main/target/target_display/target_display.cc#L611

On ESP32S3, Flash and PSRAM are sharing the same bus (SPI0), there is an arbitrator between flash and psram. So if the accesses to flash (slower) are re-directed to psram (faster), the performance could be better.

Yes, thanks for the explanation. I understand the PSRAM accesses are faster, but unless I miss something, I think the display code should really be PSRAM -> PSRAM anyway.

Or is the issue with how the display is refreshed with the "current" PSRAM buffer perhaps?

The display controler is this: https://cdn-shop.adafruit.com/product-files/5793/NV3052C-Datasheet-V0.2.pdf

Icarus113 commented 1 week ago

Unpack all needed tiles from PNG format on the flash, to RGB565 data in PSRAM

This means you will

I see you mentioned the ping-pong frame buffer, this means while (Flow C) the ping-pong buffer_x is being transmitted via DMA, the Flow A could be running as well?

Flow A is accessing the flash, and it will occupy the SPI0 bus for Flow C which is the display code that you cares about.

If the .rodata is moved in psram, then Flow A should be faster, which results in the better performance for the display code

SimonKagstrom commented 1 week ago

Thanks for your insights!

I'll look further into it tonight. In general, Flow A -> Flow B -> Flow C should be sequential (i.e., tiles are requested by the UI). However, I designed it as two threads to facilitate some yet-to-be-implemented prefetching. So Flow A should never run while flow B / flow C is running, but I'll make sure that's really the case.

Is it possible through using e.g., __attribute__((section(...))) to copy parts of the .rodata section to PSRAM, but not the rest? E.g., avoid copying the PNG images, but let the rest of .rodata be copied to PSRAM.

SimonKagstrom commented 1 week ago

Having tested a bit more tonight @Icarus113 , I still see the issue. What I did was the following:

This should ensure the UI thread always runs without interruption, and that also only a static image is shown on the screen.

However, the issue is still shown with this configuration, i.e., with 16MHz, I get a garbled display.

SimonKagstrom commented 6 days ago

More experiments:

The first change improves things a bit, but the display still can't use a pixel clock higher than 6MHz. It draws stable though, so looks nice.

With the bounce buffer, I can use 16MHz, and (display) performance seems much better. However, I see some strange flickering on the display, which I don't quite understand. I still use two frame buffers, and only draw on the inactive one before flipping.

So I still thinks it behaves strange, but I guess I can live with the stable display of the first version, although it won't be quite as fluid as I had hoped beforehand.