Open akx opened 1 month ago
Hmm I don't see anything particularly wrong in the code, but AFAICS you should be able to push roughly 60fps with 80Mhz SPI. Of course in practice that's going to be a bit lower (and I only did my napkin math real quick I might've made a mistake here).
For one full screen of filler you're doing 320x240x2x8 bits so 1_228_800
bits to be pushed over at 80Mhz should fit roughly 65 of those a second.
One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.
The display-interface
crate and embedded-hal
1.0 add quite a bit of overhead, which slows down the display update rate. To work with the new SpiDevice
trait display-interface-spi
buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch in display-interface
the performance suffers noticeable for fast displays.
The
display-interface
crate andembedded-hal
1.0 add quite a bit of overhead, which slows down the display update rate. To work with the newSpiDevice
traitdisplay-interface-spi
buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch indisplay-interface
the performance suffers noticeable for fast displays.
Hmm that's quite unfortunate. Do you think we should abandon display-interface
use? I'm also curious why 64 byte chunks are necessary? I don't remember that being any kind of requirement when we discussed the e-h 1.0 SPI traits.
Thank you for the response and thoughts.
One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.
The full log (which of course I should've posted in the first place, but here) says
I (220) cpu_start: Pro cpu start user code
I (221) cpu_start: cpu freq: 160000000 Hz
so unless something in the ESP IDF drops that down after the app starts, I think I should be running at 160MHz.
I'm also curious why 64 byte chunks are necessary?
That seems to be this bit of code.
Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.
EDIT: Changing that buffer size in a local copy of display-interface-spi
:
buffer size | result |
---|---|
64 | FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us |
128 | FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us |
256 | FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us |
512 | FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us |
1024 | FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us |
3840 | Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄 |
I'm also curious why 64 byte chunks are necessary?
That seems to be this bit of code.
Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.
EDIT: Changing that buffer size in a local copy of
display-interface-spi
:buffer size result 64
FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us
128FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us
256FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us
512FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us
1024FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us
3840 Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄
Perfect, I was hoping to see how this goes. It seems the flipflopping of buffer isn't the main slowdown here. Making the buffer bigger improves it but not in a linear fashion at all, considering the 1024 case should've been 16 time better in theory and is < 10% improvement.
Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.
I did some testing and I think the ESP HAL has quite a bit of overhead for the kind of transfers we use. I'm currently testing on a ESP32-C3 with the no_std HAL. With some tweaks I was able to get the time to clear the framebuffer down to ~60ms, which still isn't great.
Using a logic analyzer did reveal some interesting patterns:
For this capture I've used a display-interface-spi
buffer size of 256 bytes and a SPI clock of 40 MHz (I did set 60 MHz in my program, but I measured 40 MHz). The HAL splits up every 256 byte buffer into four 64 byte blocks, because that is the SPI FIFO size (https://docs.esp-rs.org/esp-hal/esp-hal/0.21.1/esp32c3/esp_hal/spi/master/struct.Spi.html#method.write_bytes). While the 64 byte sub blocks are transferred at full SPI clock speed without any gaps between bytes, the gaps between the 64 byte blocks cause a significant overhead. This alone causes the write performance to drop by ~33%.
I'm not familiar enough with the ESP32-C3 internals to know if this dead time could be reduced. The "best" solution would probably be a framebuffer in RAM, which is then transferred via DMA to the display, but that does require a lot of RAM and there is currently no support for DMA in this create and the embedded-graphics ecosystem in general.
Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.
I changed the draw bit to
display.fill_solid(&fullscreen, if i % 2 == 0 { Rgb565::BLACK } else { Rgb565::WHITE }).unwrap();
I (1347) telsu: FPS: 10.10 / render=125us clone=4us blit=99427us frame=99560us
I (1947) telsu: FPS: 10.10 / render=121us clone=4us blit=99424us frame=99552us
I (2547) telsu: FPS: 10.10 / render=125us clone=4us blit=99425us frame=99558us
I (3147) telsu: FPS: 10.10 / render=124us clone=4us blit=99426us frame=99558us
which is still kind of unfortunate (and funky, I think?)...
For what it's worth, changing ST7789::write_pixels
to use U16LEIter
instead of U16BEIter
(i.e. that the buffer doesn't need to be byte-swapped (though naturally then the data sent to the display is pretty bogus unless treated elsewhere), blit=100382us -> blit=98454us for the original test case.
I've worked on this a bit more and got the update reasonably fast by using DMA and bypassing the iterator. My implementation uses two DMA buffers, which makes it possible to fill one while the other is being transmitted. There are still gaps between the individual blocks, but the overall impact of this overhead is now fairly low.
This really shows that using an dynamically dispatched iterator to transfer lots of data is a bad idea. Even with two DMA buffers the time it takes to fill one buffer up is much longer than the time it takes to transmit a buffer. This causes the SPI bus to be idle about 80% of the time, which is why it still takes 70ms to update the display at a 80 MHz SPI clock.
But if we bypass the iterator and use DataFormat::U8
to transmit the data to the display the transfer is much faster. Filling up the new buffer now only takes a fraction of the time it takes to transmit one buffer. With 95% SPI bus utilization the display now gets updated in ~16ms.
I've uploaded my custom display-interface
in this GIST. The draw_fast
method is a bit of a hack, but has the largest impact on performance.
@rfuest Nice work! Am I understanding correctly that since the Framebuffer
is marked BigEndian
, the price for converting from native (little-endian) pixel formats to big-endian is paid when drawing to the buffer instead of when transferring, and since draw_fast
just does a "trust me bro, these bytes are what we want to push out to the bus", that works out? 😄
Yes, that's correct. Some API changes will be necessary to make this safer to use without the chance of accidentally using the wrong endianness.
In my code the CPU is often just waiting for the DMA transfer to finish and converting a little-endian framebuffer into a big-endian format during that time wouldn't negatively impact the time it takes to update the display. But the display-interface
crate doesn't provide a way to specify the endianness of a &[u8]
buffer. This hack works and is as fast as before, but it is on a whole other level of "trust me bro".
So from what I understand there are a few issues here:
display-interface
forcing 64 bit buffer sizedisplay-interface
forcing endianness runtime conversiondisplay-interface
using dynamic dispatchOptionally I think another issue is that:
embedded-hal
not providing a native iterator SPI write/transfer/read
methodsI think if embedded-hal
actually had iterator support, and if we figured out a better abstraction for display-interface
we should be able to get a fairly quick write using an iterator with 0 additional buffering?
display-interface forcing 64 bit buffer size
display-interface
doesn't force any buffer size, but the implementation for SPI in display-interface-spi
does use a 64 word buffer to convert the iterator based DataFormat
s into a buffer of bytes that can be passed to the SpiDevice
impl.
display-interface forcing endianness runtime conversion
No it doesn't force runtime conversion, but It only support different endianness with u16
buffers with the DataFormat::{U16BE, U16LE}
variants. In my opinion it is inconvenient that you need to cast a more common &[u8]
buffer to &[u16]
first, which to my knowledge requires an external crate like byte-slice-cast
(which the display-interface
impls use internally).
display-interface
using dynamic dispatch
This has been an issue for a long time and there is even an open draft PR from 2021 with a possible solution in the display-interface
repo, but nobody seems to have cared about that: https://github.com/therealprof/display-interface/pull/19
I'm not sure if we (or someone) should fix display-interface
, which seems to be more or less unmaintained, or if it is better to create a new alternative. In any case this will need some careful design to get this right.
embedded-hal
not providing a native iterator SPIwrite/transfer/read
methodsI think if
embedded-hal
actually had iterator support, and if we figured out a better abstraction fordisplay-interface
we should be able to get a fairly quick write using an iterator with 0 additional buffering?
This might help for some drawing operations, but I'm not sure it is needed. Randomly drawing individual pixels over a relatively slow connection like SPI always comes with a lot of overhead. For images we shouldn't use iterators at all.
I'm planning to add a something like draw_image(&mut self, image: &ImageRaw<...>, position: Point)
to e-g's DrawTarget
trait, which will allow DrawTarget
implementations to use a more efficient code path in that case. And because ImageRaw
is just a thin wrapper around a byte slice, that specifies the image dimensions, color format, and byte order, this could also be used for other applications like writing an entire RAM framebuffer to a display. I'm still not sure about the details, but ideally this addition to e-g and the changes to display-interface
would make it possible to use DMA transfers for displaying images in more cases. And this might also be a stepping stone towards async support.
One thing that mipidsi
can do here is to introduce something more than just draw_pixels
to the model.
If we allowed something like write_pixels_raw_u8()
with &[u8]
and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.
That could then be used by the expanded e-g
as well as internally where possible. I think all it'd take really is to add a blanket implementation into Model
here as first step on the mipidsi
side.
If you see my luluu repo, I went through the process of getting within a few percent of the theoretical bandwidth limits of the raspi 2040 spi device. There is some interesting stuff in the main firmware but in particular see the patches in the vendored crates, including the rp-hal implementation and mipidsi.
If we allowed something like write_pixels_raw_u8() with &[u8] and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.
I believe I did something very similar to what you're describing here.
It's not clear to me how much these "fixes" can or should actually be upstreamed but maybe it's useful for reference https://github.com/fu5ha/luluu/tree/main/software
@rfuest Lmk what you think about https://github.com/almindor/mipidsi/pull/143
@fu5ha Thanks I'll have a look!
@akx could you please try the Display::set_pixels_from_buffer()
(need to use latest master) to see how much of a difference it makes?
@almindor Sorry, I was away from this project for a while 😅
Not a great improvement (current mess of a code here – curiously enough my particular display doesn't seem to care if I pass in 319 or 320 as the "width" coordinate):
$ cargo espflash flash --monitor -p /dev/cu.usbmodem1101 --release
[2024-11-03T16:59:22Z INFO ] 🚀 A new version of cargo-espflash is available: v3.2.0
[2024-11-03T16:59:22Z INFO ] Serial port: '/dev/cu.usbmodem1101'
[2024-11-03T16:59:22Z INFO ] Connecting...
[2024-11-03T16:59:22Z INFO ] Using flash stub
Compiling telsu v0.1.0 (/Users/akx/build/telsu)
Finished `release` profile [optimized] target(s) in 1.39s
Chip type: esp32c6 (revision v0.1)
Crystal frequency: 40 MHz
Flash size: 8MB
Features: WiFi 6, BT 5
MAC address: f0:f5:bd:01:64:8c
Bootloader: /Users/akx/build/telsu/target/riscv32imac-esp-espidf/release/build/esp-idf-sys-86f39b5293325c54/out/build/bootloader/bootloader.bin
Partition table: part.csv
App/part. size: 462,960/4,194,304 bytes, 11.04%
[2024-11-03T16:59:24Z INFO ] Segment at address '0x0' has not changed, skipping write
[2024-11-03T16:59:24Z INFO ] Segment at address '0x8000' has not changed, skipping write
[00:00:03] [========================================] 250/250 0x10000 [2024-11-03T16:59:28Z INFO ] Flashing has completed!
Commands:
CTRL+R Reset chip
CTRL+C Exit
ESP-ROM:esp32c6-20220919
Build:Sep 19 2022
rst:0x15 (USB_UART_HPSYS),boot:0xc (SPI_FAST_FLASH_BOOT)
Saved PC:0x4080053c
0x4080053c - rmt_driver_isr_default
at ??:??
SPIWP:0xee
mode:DIO, clock div:2
load:0x40875720,len:0x1804
load:0x4086c410,len:0xe2c
load:0x4086e610,len:0x2e24
entry 0x4086c41a
I (23) boot: ESP-IDF v5.2.2 2nd stage bootloader
I (23) boot: compile time Nov 3 2024 18:50:31
I (24) boot: chip revision: v0.1
I (26) boot.esp32c6: SPI Speed : 80MHz
I (30) boot.esp32c6: SPI Mode : DIO
I (35) boot.esp32c6: SPI Flash Size : 8MB
I (40) boot: Enabling RNG early entropy source...
I (45) boot: Partition Table:
I (49) boot: ## Label Usage Type ST Offset Length
I (56) boot: 0 nvs WiFi data 01 02 00009000 00006000
I (64) boot: 1 phy_init RF data 01 01 0000f000 00001000
I (71) boot: 2 factory factory app 00 00 00010000 00400000
I (79) boot: End of partition table
I (83) esp_image: segment 0: paddr=00010020 vaddr=42000020 size=4c05ch (311388) map
I (156) esp_image: segment 1: paddr=0005c084 vaddr=40800000 size=03f94h ( 16276) load
I (160) esp_image: segment 2: paddr=00060020 vaddr=42050020 size=175d8h ( 95704) map
I (181) esp_image: segment 3: paddr=00077600 vaddr=40803f94 size=08070h ( 32880) load
I (190) esp_image: segment 4: paddr=0007f678 vaddr=4080c010 size=019d4h ( 6612) load
I (195) boot: Loaded app from partition at offset 0x10000
I (196) boot: Disabling RNG early entropy source...
I (210) cpu_start: Unicore app
W (218) clk: esp_perip_clk_init() has not been implemented yet
I (225) cpu_start: Pro cpu start user code
I (225) cpu_start: cpu freq: 160000000 Hz
I (226) cpu_start: Application information:
I (228) cpu_start: Project name: libespidf
I (233) cpu_start: App version: a2f4705
I (238) cpu_start: Compile time: Nov 3 2024 18:50:25
I (244) cpu_start: ELF file SHA256: 000000000...
I (250) cpu_start: ESP-IDF: v5.2.2
I (254) cpu_start: Min chip rev: v0.0
I (259) cpu_start: Max chip rev: v0.99
I (264) cpu_start: Chip rev: v0.1
I (269) heap_init: Initializing. RAM available for dynamic allocation:
I (276) heap_init: At 4080ED80 len 0006D890 (438 KiB): RAM
I (282) heap_init: At 4087C610 len 00002F54 (11 KiB): RAM
I (288) heap_init: At 50000000 len 00003FE8 (15 KiB): RTCRAM
I (295) spi_flash: detected chip: generic
I (299) spi_flash: flash io: dio
W (303) rmt(legacy): legacy driver is deprecated, please migrate to `driver/rmt_tx.h` and/or `driver/rmt_rx.h`
W (314) pcnt(legacy): legacy driver is deprecated, please migrate to `driver/pulse_cnt.h`
W (323) i2c: This driver is an old driver, please migrate your application code to adapt `driver/i2c_master.h`
W (333) timer_group: legacy driver is deprecated, please migrate to `driver/gptimer.h`
I (342) sleep: Configure to isolate all GPIO pins in sleep state
I (349) sleep: Enable automatic switching of GPIO sleep configuration
I (356) coexist: coex firmware version: d96c1e51f
I (361) coexist: coexist rom version 5b8dcfa
I (367) main_task: Started on CPU0
I (367) main_task: Calling app_main()
I (367) telsu: Hello, world!
I (377) telsu: Initializing LED
I (377) telsu: Initializing pins
I (377) gpio: GPIO[23]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
I (387) gpio: GPIO[22]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
I (397) gpio: GPIO[21]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
I (407) telsu: Initializing SPI device
I (417) telsu: Initializing display_interface_spi
I (417) telsu: Initializing display
I (727) telsu: Setting BL
I (767) telsu: Allocated buffer in 48us, filled buffer in 41370us
I (1367) telsu: Tick: 6, FPS: 13.89 / render=123us blit=72075us frame=72202us
I (1877) telsu: Tick: 13, FPS: 13.89 / render=121us blit=72078us frame=72203us
I (2387) telsu: Tick: 20, FPS: 13.89 / render=121us blit=72081us frame=72206us
I (2887) telsu: Tick: 27, FPS: 13.89 / render=122us blit=72077us frame=72203us
I (3397) telsu: Tick: 34, FPS: 13.89 / render=122us blit=72077us frame=72203us
I (3907) telsu: Tick: 41, FPS: 13.89 / render=121us blit=72082us frame=72207us
I (4407) telsu: Tick: 48, FPS: 13.89 / render=121us blit=72078us frame=72203us
I (4917) telsu: Tick: 55, FPS: 13.89 / render=124us blit=72075us frame=72203us
I (5417) telsu: Tick: 62, FPS: 13.89 / render=125us blit=72074us frame=72203us
I (5927) telsu: Tick: 69, FPS: 13.89 / render=124us blit=72075us frame=72203us
I (6437) telsu: Tick: 76, FPS: 13.89 / render=123us blit=72080us frame=72207us
Rendering 320x120 is exactly twice as fast:
I (1347) telsu: Tick: 13, FPS: 27.78 / render=123us blit=36107us frame=36234us
I (1857) telsu: Tick: 27, FPS: 27.78 / render=125us blit=36118us frame=36247us
I (2367) telsu: Tick: 41, FPS: 27.78 / render=121us blit=36109us frame=36234us
I (2877) telsu: Tick: 55, FPS: 27.78 / render=120us blit=36114us frame=36238us
Rendering 160x120 is exactly twice as fast as that:
I (1347) telsu: Tick: 27, FPS: 55.56 / render=121us blit=18142us frame=18267us
I (1857) telsu: Tick: 55, FPS: 55.56 / render=119us blit=18133us frame=18256us
I (2367) telsu: Tick: 83, FPS: 55.56 / render=118us blit=18130us frame=18252us
A similar program in C, using Espressif's own SDK and SPI drivers etc. pushes 65 FPS:
Frame: 9, Avg frame time: 15097 usec; FPS: 66
Frame: 19, Avg frame time: 15204 usec; FPS: 65
Frame: 29, Avg frame time: 15203 usec; FPS: 65
Frame: 39, Avg frame time: 15203 usec; FPS: 65
Frame: 49, Avg frame time: 15203 usec; FPS: 65
Not a great improvement
That's unfortunate. I'd expect a much better improvement. I got myself esp32-C6, once I have some time I'll try to experiment myself and see where the holdups are.
Ok so for me, using the esp32-C6
with SPI set to 80Mhz
and a 320x240
ST7789
display with CS:
~30ms
fullscreen draw time when using the new Display::set_pixels_from_buffer
so roughly 33FPS, still slow but a bit better.fill_solid
drops down to ~70ms
so worse than twice slowdown!fill_contiguous
I wasn't able to create a full framebuffer with Rgb565
, but using a .copied().cycled()
iterator it's even slightly worse than fill_solid
at about ~72ms
So at least on the C6 this clearly shows that the buffering/abstraction we currently depend on is a big problem, but not the only issue.
I'm going to try and see if I can get closer to the C speed (pun intended) by removing the 64byte buffer problem in the display-interface
directly next.
We should be able, when using set_pixels_from_buffer
and no "mid-re-buffering" able to get the full ~60 FPS speeds expected, but I don't know where the additional slowdown could be. At this point it should really just be buffer sends.
I did some more testing and if I deconstruct the SPI back after sending all the init stuff and prepping a data window, I'm able to send the raw buffer directly using the SPI, without and display-interface
involvement, but it still clocks at exactly the same speed as the set_pixels_from_buffer
.
This means that the SPI implementation of the esp32-hal
(I'm not using IDF) is probably sub-optimal somehow.
I also identified that esp-hal
will default to 80Mhz
for CPU clock speed, after fixing that I got overall improvement but still not good enough.
Full code here
NOTE: you need prep_pixels_from_buffer
added to mipidsi
for this to work, the code for that is also in the GIST at the bottom. It's just a hack to get this going for testing, the basic idea is to see if raw SPI direct u8 buffer transfer is the culprit here, which it seems to be.
My results are:
INFO - running at 160000000 Hz
INFO - Spi created
INFO - SpiDevice created
INFO - DI created
INFO - Display created
INFO - solid drawing time 44875 us
INFO - from_buffer drawing time 24692 us
INFO - direct drawing time 24666 us
I'm wondering if:
printf("Initializing SPI bus!\n");
ESP_ERROR_CHECK(spi_bus_initialize(
LCD_HOST, &buscfg, SPI_DMA_CH_AUTO)); // Enable the DMA feature
This might explain the discrepancy here. @rfuest what do you think? I'm guessing perhaps the DMA feature isn't done when using Rust code? I found a reference to SpiDmaBus but I'm unsure how to actually instantiate this. I suspect it's what's used in the C code with their DMA_CH_AUTO
setup.
@rfuest @akx Found it! Note I used only esp-hal
not the IDF/std abstraction.
The main issues are:
Display::set_pixels_from_buffer
is cruicial due to what we discussed originallyThe code for the "fixed up" version is here
The result now looks like this:
INFO - solid drawing time 44062 us
INFO - from_buffer drawing time 16103 us
INFO - direct drawing time 16050 us
Which means ~62.5 FPS.
I'll keep this issue opened until we release mipidsi
with the new set_pixels_from_buffer
method. The display-interface
situation still needs thinking over as well of course as it does cause a major slowdown.
Heya,
I'm wondering how to speed things up for an application that will likely need full-screen updates most of the time.
I have an ESP32-C6-DevKitC-1 and a WaveShare 280x240 1.69" LCD module, using
esp-idf-svc
(at present anyway).My current experiment code (please excuse the mess, it's an experiment so far) is
and the interesting (performance) output for a
--release
build is:IOW, 99.2% of the frame time is spent in
fill_contiguous
. Setting the SPI baudrate to something lower than 80 MHz (which, AIUI, is already pushing it especially given my display is behind 10-centimeter DuPont wires 🤠) doesn't change things a lot; blit time becomes about 123547us).Is there something obvious I'm doing wrong for an application like this, where I basically just have a buffer of Rgb565 to push to the screen?
And of course thank you for the work you've put into the library and the ecosystem at large! I was surprised to see things working at all (after, of course, having heeded the big red instructions on WaveShare's wiki and powered the display from 3V3 and not 5V...).