How to speed things up (ST7789V2 + ESP32-C6-DevKitC-1)

akx commented 1 month ago

Heya,

I'm wondering how to speed things up for an application that will likely need full-screen updates most of the time.

I have an ESP32-C6-DevKitC-1 and a WaveShare 280x240 1.69" LCD module, using esp-idf-svc (at present anyway).

My current experiment code (please excuse the mess, it's an experiment so far) is

use std::collections::VecDeque;
use std::time::Instant;
use esp_idf_svc::hal::gpio::PinDriver;
use esp_idf_svc::hal::prelude::*;
use crate::led::WS2812RMT;

use embedded_graphics::{
    prelude::*,
    primitives::Rectangle,
};
use embedded_hal::spi::MODE_3;
use esp_idf_svc::hal::delay::Ets;
use esp_idf_svc::hal::spi::config::{BitOrder, Config, Duplex};
use esp_idf_svc::hal::spi::{SpiDeviceDriver, SpiDriverConfig};
use display_interface_spi::SPIInterface;
use embedded_graphics::pixelcolor::{Rgb565, Rgb888};
use mipidsi::models::ST7789;
use mipidsi::options::{ColorInversion, Orientation, RefreshOrder, Rotation};
use rand::{Rng, SeedableRng};
use rand::rngs::SmallRng;
use crate::ticktock::Ticktock;

mod led;
mod ticktock;

fn main() -> anyhow::Result<()> {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    log::info!("Hello, world!");
    let peripherals = Peripherals::take()?;
    log::info!("Initializing LED");
    let mut led = WS2812RMT::new(peripherals.pins.gpio8, peripherals.rmt.channel0)?;
    led.set_pixel(50, 50, 0)?;
    let mut rng = SmallRng::from_entropy();
    log::info!("Initializing pins");
    let spi = peripherals.spi2;
    let mut backlight = PinDriver::output(peripherals.pins.gpio23)?;
    let rst = PinDriver::output(peripherals.pins.gpio22)?;
    let dc = PinDriver::output(peripherals.pins.gpio21)?; // DC (data/command)
    let sclk = peripherals.pins.gpio19; // CLK
    let sda = peripherals.pins.gpio18; // DIN
    let sdi = peripherals.pins.gpio15; // Tied low
    let cs = peripherals.pins.gpio20; // CS
    log::info!("Initializing SPI device");
    let mut delay = Ets;
    let config = Config::new()
        .bit_order(BitOrder::MsbFirst)
        .data_mode(MODE_3)
        .baudrate(80.MHz().into());

    let device = SpiDeviceDriver::new_single(
        spi,
        sclk,
        sda,
        Some(sdi),
        Some(cs),
        &SpiDriverConfig::new(),
        &config,
    )?;
    log::info!("Initializing display_interface_spi");
    let di = SPIInterface::new(device, dc);
    log::info!("Initializing display");
    let mut display = mipidsi::Builder::new(ST7789, di)
        .reset_pin(rst)
        .display_size(240, 320)
        .orientation(Orientation::new().rotate(Rotation::Deg90))
        .invert_colors(ColorInversion::Inverted)
        .init(&mut delay).unwrap();
    log::info!("Setting BL");
    backlight.set_high()?;

    let mut sw = Ticktock::new();
    let mut buf: Vec<Rgb565> = Vec::with_capacity(320*240);
    let alloc_time = sw.tick();
    for i in 0..240 {
        for j in 0..320u16 {
            let val = i ^ (j as u8);
            let px = Rgb888::new(val, val, val);
            buf.push(Rgb565::from(px));
        }
    }
    let initial_fill_time = sw.tick();
    log::info!("Allocated buffer in {alloc_time}us, filled buffer in {initial_fill_time}us",
        alloc_time = alloc_time,
        initial_fill_time = initial_fill_time,
    );

    let mut frame_times: VecDeque<u32> = VecDeque::with_capacity(20);
    let fullscreen = Rectangle::new(Point::new(0,0), Size::new(320,240));
    let mut last_fps_print_time = Instant::now();
    for _i in 1..10000 {
        let mut sw = Ticktock::new();
        for _ in 1..100 {
            let inoff = rng.gen_range(0..buf.len());
            let outoff = rng.gen_range(0..buf.len());
            buf.swap(inoff, outoff);
        }
        let render_time = sw.tick();
        let clon = buf.clone();
        let clone_time = sw.tick();
        display.fill_contiguous(&fullscreen, clon).unwrap();
        let blit_time = sw.tick();
        let frame_time = sw.total();
        frame_times.push_back((frame_time / 1000) as u32);
        if frame_times.len() > 20 {
            frame_times.pop_front();
        }
        if last_fps_print_time.elapsed().as_millis() > 500 {
            led.set_pixel(0, 50, 0)?;
            let fps = 1000.0 / (frame_times.iter().sum::<u32>() as f32 / (frame_times.len() as f32));
            log::info!(
                "FPS: {fps:.2} / render={render_time}us clone={clone_time}us blit={blit_time}us frame={frame_time}us",
                fps = fps,
                render_time = render_time,
                clone_time = clone_time,
                blit_time = blit_time,
                frame_time = frame_time,
            );
            last_fps_print_time = Instant::now();
            led.set_pixel(0, 0, 50)?;
        }
    }
    unreachable!();
}

and the interesting (performance) output for a --release build is:

I (742) telsu: Allocated buffer in 35us, filled buffer in 25514us
I (1292) telsu: FPS: 9.26 / render=120us clone=649us blit=107889us frame=108662us
I (1832) telsu: FPS: 9.26 / render=122us clone=649us blit=107897us frame=108672us

IOW, 99.2% of the frame time is spent in fill_contiguous. Setting the SPI baudrate to something lower than 80 MHz (which, AIUI, is already pushing it especially given my display is behind 10-centimeter DuPont wires 🤠) doesn't change things a lot; blit time becomes about 123547us).

Is there something obvious I'm doing wrong for an application like this, where I basically just have a buffer of Rgb565 to push to the screen?

And of course thank you for the work you've put into the library and the ecosystem at large! I was surprised to see things working at all (after, of course, having heeded the big red instructions on WaveShare's wiki and powered the display from 3V3 and not 5V...).

almindor commented 1 month ago

Hmm I don't see anything particularly wrong in the code, but AFAICS you should be able to push roughly 60fps with 80Mhz SPI. Of course in practice that's going to be a bit lower (and I only did my napkin math real quick I might've made a mistake here).

For one full screen of filler you're doing 320x240x2x8 bits so 1_228_800 bits to be pushed over at 80Mhz should fit roughly 65 of those a second.

One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.

rfuest commented 1 month ago

The display-interface crate and embedded-hal 1.0 add quite a bit of overhead, which slows down the display update rate. To work with the new SpiDevice trait display-interface-spi buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch in display-interface the performance suffers noticeable for fast displays.

almindor commented 1 month ago

The display-interface crate and embedded-hal 1.0 add quite a bit of overhead, which slows down the display update rate. To work with the new SpiDevice trait display-interface-spi buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch in display-interface the performance suffers noticeable for fast displays.

Hmm that's quite unfortunate. Do you think we should abandon display-interface use? I'm also curious why 64 byte chunks are necessary? I don't remember that being any kind of requirement when we discussed the e-h 1.0 SPI traits.

akx commented 1 month ago

Thank you for the response and thoughts.

One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.

The full log (which of course I should've posted in the first place, but here) says

I (220) cpu_start: Pro cpu start user code
I (221) cpu_start: cpu freq: 160000000 Hz

so unless something in the ESP IDF drops that down after the app starts, I think I should be running at 160MHz.

akx commented 1 month ago

I'm also curious why 64 byte chunks are necessary?

That seems to be this bit of code.

Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.

EDIT: Changing that buffer size in a local copy of display-interface-spi:

buffer size	result
64	`FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us`
128	`FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us`
256	`FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us`
512	`FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us`
1024	`FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us`
3840	Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄

almindor commented 1 month ago

I'm also curious why 64 byte chunks are necessary?

That seems to be this bit of code.

Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.

EDIT: Changing that buffer size in a local copy of display-interface-spi:

buffer size result 64 FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us 128 FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us 256 FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us 512 FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us 1024 FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us 3840 Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄

Perfect, I was hoping to see how this goes. It seems the flipflopping of buffer isn't the main slowdown here. Making the buffer bigger improves it but not in a linear fashion at all, considering the 1024 case should've been 16 time better in theory and is < 10% improvement.

Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.

rfuest commented 1 month ago

I did some testing and I think the ESP HAL has quite a bit of overhead for the kind of transfers we use. I'm currently testing on a ESP32-C3 with the no_std HAL. With some tweaks I was able to get the time to clear the framebuffer down to ~60ms, which still isn't great.

Using a logic analyzer did reveal some interesting patterns: For this capture I've used a display-interface-spi buffer size of 256 bytes and a SPI clock of 40 MHz (I did set 60 MHz in my program, but I measured 40 MHz). The HAL splits up every 256 byte buffer into four 64 byte blocks, because that is the SPI FIFO size (https://docs.esp-rs.org/esp-hal/esp-hal/0.21.1/esp32c3/esp_hal/spi/master/struct.Spi.html#method.write_bytes). While the 64 byte sub blocks are transferred at full SPI clock speed without any gaps between bytes, the gaps between the 64 byte blocks cause a significant overhead. This alone causes the write performance to drop by ~33%.

I'm not familiar enough with the ESP32-C3 internals to know if this dead time could be reduced. The "best" solution would probably be a framebuffer in RAM, which is then transferred via DMA to the display, but that does require a lot of RAM and there is currently no support for DMA in this create and the embedded-graphics ecosystem in general.

akx commented 1 month ago

Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.

I changed the draw bit to

display.fill_solid(&fullscreen, if i % 2 == 0 { Rgb565::BLACK } else { Rgb565::WHITE }).unwrap();

I (1347) telsu: FPS: 10.10 / render=125us clone=4us blit=99427us frame=99560us
I (1947) telsu: FPS: 10.10 / render=121us clone=4us blit=99424us frame=99552us
I (2547) telsu: FPS: 10.10 / render=125us clone=4us blit=99425us frame=99558us
I (3147) telsu: FPS: 10.10 / render=124us clone=4us blit=99426us frame=99558us

which is still kind of unfortunate (and funky, I think?)...

For what it's worth, changing ST7789::write_pixels to use U16LEIter instead of U16BEIter (i.e. that the buffer doesn't need to be byte-swapped (though naturally then the data sent to the display is pretty bogus unless treated elsewhere), blit=100382us -> blit=98454us for the original test case.

rfuest commented 1 month ago

I've worked on this a bit more and got the update reasonably fast by using DMA and bypassing the iterator. My implementation uses two DMA buffers, which makes it possible to fill one while the other is being transmitted. There are still gaps between the individual blocks, but the overall impact of this overhead is now fairly low.

This really shows that using an dynamically dispatched iterator to transfer lots of data is a bad idea. Even with two DMA buffers the time it takes to fill one buffer up is much longer than the time it takes to transmit a buffer. This causes the SPI bus to be idle about 80% of the time, which is why it still takes 70ms to update the display at a 80 MHz SPI clock.

But if we bypass the iterator and use DataFormat::U8 to transmit the data to the display the transfer is much faster. Filling up the new buffer now only takes a fraction of the time it takes to transmit one buffer. With 95% SPI bus utilization the display now gets updated in ~16ms.

I've uploaded my custom display-interface in this GIST. The draw_fast method is a bit of a hack, but has the largest impact on performance.

akx commented 1 month ago

@rfuest Nice work! Am I understanding correctly that since the Framebuffer is marked BigEndian, the price for converting from native (little-endian) pixel formats to big-endian is paid when drawing to the buffer instead of when transferring, and since draw_fast just does a "trust me bro, these bytes are what we want to push out to the bus", that works out? 😄

rfuest commented 1 month ago

Yes, that's correct. Some API changes will be necessary to make this safer to use without the chance of accidentally using the wrong endianness.

In my code the CPU is often just waiting for the DMA transfer to finish and converting a little-endian framebuffer into a big-endian format during that time wouldn't negatively impact the time it takes to update the display. But the display-interface crate doesn't provide a way to specify the endianness of a &[u8] buffer. This hack works and is as fast as before, but it is on a whole other level of "trust me bro".

almindor commented 1 month ago

So from what I understand there are a few issues here:

display-interface forcing 64 bit buffer size
display-interface forcing endianness runtime conversion
display-interface using dynamic dispatch

Optionally I think another issue is that:

embedded-hal not providing a native iterator SPI write/transfer/read methods

I think if embedded-hal actually had iterator support, and if we figured out a better abstraction for display-interface we should be able to get a fairly quick write using an iterator with 0 additional buffering?

rfuest commented 1 month ago

display-interface forcing 64 bit buffer size

display-interface doesn't force any buffer size, but the implementation for SPI in display-interface-spi does use a 64 word buffer to convert the iterator based DataFormats into a buffer of bytes that can be passed to the SpiDevice impl.

display-interface forcing endianness runtime conversion

No it doesn't force runtime conversion, but It only support different endianness with u16 buffers with the DataFormat::{U16BE, U16LE} variants. In my opinion it is inconvenient that you need to cast a more common &[u8] buffer to &[u16] first, which to my knowledge requires an external crate like byte-slice-cast (which the display-interface impls use internally).

display-interface using dynamic dispatch

This has been an issue for a long time and there is even an open draft PR from 2021 with a possible solution in the display-interface repo, but nobody seems to have cared about that: https://github.com/therealprof/display-interface/pull/19

I'm not sure if we (or someone) should fix display-interface, which seems to be more or less unmaintained, or if it is better to create a new alternative. In any case this will need some careful design to get this right.

embedded-hal not providing a native iterator SPI write/transfer/read methods

I think if embedded-hal actually had iterator support, and if we figured out a better abstraction for display-interface we should be able to get a fairly quick write using an iterator with 0 additional buffering?

This might help for some drawing operations, but I'm not sure it is needed. Randomly drawing individual pixels over a relatively slow connection like SPI always comes with a lot of overhead. For images we shouldn't use iterators at all.

I'm planning to add a something like draw_image(&mut self, image: &ImageRaw<...>, position: Point) to e-g's DrawTarget trait, which will allow DrawTarget implementations to use a more efficient code path in that case. And because ImageRaw is just a thin wrapper around a byte slice, that specifies the image dimensions, color format, and byte order, this could also be used for other applications like writing an entire RAM framebuffer to a display. I'm still not sure about the details, but ideally this addition to e-g and the changes to display-interface would make it possible to use DMA transfers for displaying images in more cases. And this might also be a stepping stone towards async support.

almindor commented 1 month ago

One thing that mipidsi can do here is to introduce something more than just draw_pixels to the model.

If we allowed something like write_pixels_raw_u8() with &[u8] and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.

That could then be used by the expanded e-g as well as internally where possible. I think all it'd take really is to add a blanket implementation into Model here as first step on the mipidsi side.

fu5ha commented 1 month ago

If you see my luluu repo, I went through the process of getting within a few percent of the theoretical bandwidth limits of the raspi 2040 spi device. There is some interesting stuff in the main firmware but in particular see the patches in the vendored crates, including the rp-hal implementation and mipidsi.

If we allowed something like write_pixels_raw_u8() with &[u8] and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.

I believe I did something very similar to what you're describing here.

It's not clear to me how much these "fixes" can or should actually be upstreamed but maybe it's useful for reference https://github.com/fu5ha/luluu/tree/main/software

almindor commented 1 month ago

@rfuest Lmk what you think about https://github.com/almindor/mipidsi/pull/143

@fu5ha Thanks I'll have a look!

almindor commented 4 weeks ago

@akx could you please try the Display::set_pixels_from_buffer() (need to use latest master) to see how much of a difference it makes?

akx commented 2 weeks ago

@almindor Sorry, I was away from this project for a while 😅

Not a great improvement (current mess of a code here – curiously enough my particular display doesn't seem to care if I pass in 319 or 320 as the "width" coordinate):

$ cargo espflash flash --monitor -p /dev/cu.usbmodem1101 --release
[2024-11-03T16:59:22Z INFO ] 🚀 A new version of cargo-espflash is available: v3.2.0
[2024-11-03T16:59:22Z INFO ] Serial port: '/dev/cu.usbmodem1101'
[2024-11-03T16:59:22Z INFO ] Connecting...
[2024-11-03T16:59:22Z INFO ] Using flash stub
   Compiling telsu v0.1.0 (/Users/akx/build/telsu)
    Finished `release` profile [optimized] target(s) in 1.39s
Chip type:         esp32c6 (revision v0.1)
Crystal frequency: 40 MHz
Flash size:        8MB
Features:          WiFi 6, BT 5
MAC address:       f0:f5:bd:01:64:8c
Bootloader:        /Users/akx/build/telsu/target/riscv32imac-esp-espidf/release/build/esp-idf-sys-86f39b5293325c54/out/build/bootloader/bootloader.bin
Partition table:   part.csv
App/part. size:    462,960/4,194,304 bytes, 11.04%
[2024-11-03T16:59:24Z INFO ] Segment at address '0x0' has not changed, skipping write
[2024-11-03T16:59:24Z INFO ] Segment at address '0x8000' has not changed, skipping write
[00:00:03] [========================================]     250/250     0x10000                                                                                     [2024-11-03T16:59:28Z INFO ] Flashing has completed!
Commands:
    CTRL+R    Reset chip
    CTRL+C    Exit

ESP-ROM:esp32c6-20220919
Build:Sep 19 2022
rst:0x15 (USB_UART_HPSYS),boot:0xc (SPI_FAST_FLASH_BOOT)
Saved PC:0x4080053c
0x4080053c - rmt_driver_isr_default
    at ??:??
SPIWP:0xee
mode:DIO, clock div:2
load:0x40875720,len:0x1804
load:0x4086c410,len:0xe2c
load:0x4086e610,len:0x2e24
entry 0x4086c41a
I (23) boot: ESP-IDF v5.2.2 2nd stage bootloader
I (23) boot: compile time Nov  3 2024 18:50:31
I (24) boot: chip revision: v0.1
I (26) boot.esp32c6: SPI Speed      : 80MHz
I (30) boot.esp32c6: SPI Mode       : DIO
I (35) boot.esp32c6: SPI Flash Size : 8MB
I (40) boot: Enabling RNG early entropy source...
I (45) boot: Partition Table:
I (49) boot: ## Label            Usage          Type ST Offset   Length
I (56) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (64) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (71) boot:  2 factory          factory app      00 00 00010000 00400000
I (79) boot: End of partition table
I (83) esp_image: segment 0: paddr=00010020 vaddr=42000020 size=4c05ch (311388) map
I (156) esp_image: segment 1: paddr=0005c084 vaddr=40800000 size=03f94h ( 16276) load
I (160) esp_image: segment 2: paddr=00060020 vaddr=42050020 size=175d8h ( 95704) map
I (181) esp_image: segment 3: paddr=00077600 vaddr=40803f94 size=08070h ( 32880) load
I (190) esp_image: segment 4: paddr=0007f678 vaddr=4080c010 size=019d4h (  6612) load
I (195) boot: Loaded app from partition at offset 0x10000
I (196) boot: Disabling RNG early entropy source...
I (210) cpu_start: Unicore app
W (218) clk: esp_perip_clk_init() has not been implemented yet
I (225) cpu_start: Pro cpu start user code
I (225) cpu_start: cpu freq: 160000000 Hz
I (226) cpu_start: Application information:
I (228) cpu_start: Project name:     libespidf
I (233) cpu_start: App version:      a2f4705
I (238) cpu_start: Compile time:     Nov  3 2024 18:50:25
I (244) cpu_start: ELF file SHA256:  000000000...
I (250) cpu_start: ESP-IDF:          v5.2.2
I (254) cpu_start: Min chip rev:     v0.0
I (259) cpu_start: Max chip rev:     v0.99
I (264) cpu_start: Chip rev:         v0.1
I (269) heap_init: Initializing. RAM available for dynamic allocation:
I (276) heap_init: At 4080ED80 len 0006D890 (438 KiB): RAM
I (282) heap_init: At 4087C610 len 00002F54 (11 KiB): RAM
I (288) heap_init: At 50000000 len 00003FE8 (15 KiB): RTCRAM
I (295) spi_flash: detected chip: generic
I (299) spi_flash: flash io: dio
W (303) rmt(legacy): legacy driver is deprecated, please migrate to `driver/rmt_tx.h` and/or `driver/rmt_rx.h`
W (314) pcnt(legacy): legacy driver is deprecated, please migrate to `driver/pulse_cnt.h`
W (323) i2c: This driver is an old driver, please migrate your application code to adapt `driver/i2c_master.h`
W (333) timer_group: legacy driver is deprecated, please migrate to `driver/gptimer.h`
I (342) sleep: Configure to isolate all GPIO pins in sleep state
I (349) sleep: Enable automatic switching of GPIO sleep configuration
I (356) coexist: coex firmware version: d96c1e51f
I (361) coexist: coexist rom version 5b8dcfa
I (367) main_task: Started on CPU0
I (367) main_task: Calling app_main()
I (367) telsu: Hello, world!
I (377) telsu: Initializing LED
I (377) telsu: Initializing pins
I (377) gpio: GPIO[23]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
I (387) gpio: GPIO[22]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
I (397) gpio: GPIO[21]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
I (407) telsu: Initializing SPI device
I (417) telsu: Initializing display_interface_spi
I (417) telsu: Initializing display
I (727) telsu: Setting BL
I (767) telsu: Allocated buffer in 48us, filled buffer in 41370us
I (1367) telsu: Tick: 6, FPS: 13.89 / render=123us blit=72075us frame=72202us
I (1877) telsu: Tick: 13, FPS: 13.89 / render=121us blit=72078us frame=72203us
I (2387) telsu: Tick: 20, FPS: 13.89 / render=121us blit=72081us frame=72206us
I (2887) telsu: Tick: 27, FPS: 13.89 / render=122us blit=72077us frame=72203us
I (3397) telsu: Tick: 34, FPS: 13.89 / render=122us blit=72077us frame=72203us
I (3907) telsu: Tick: 41, FPS: 13.89 / render=121us blit=72082us frame=72207us
I (4407) telsu: Tick: 48, FPS: 13.89 / render=121us blit=72078us frame=72203us
I (4917) telsu: Tick: 55, FPS: 13.89 / render=124us blit=72075us frame=72203us
I (5417) telsu: Tick: 62, FPS: 13.89 / render=125us blit=72074us frame=72203us
I (5927) telsu: Tick: 69, FPS: 13.89 / render=124us blit=72075us frame=72203us
I (6437) telsu: Tick: 76, FPS: 13.89 / render=123us blit=72080us frame=72207us

EDIT

Rendering 320x120 is exactly twice as fast:

I (1347) telsu: Tick: 13, FPS: 27.78 / render=123us blit=36107us frame=36234us
I (1857) telsu: Tick: 27, FPS: 27.78 / render=125us blit=36118us frame=36247us
I (2367) telsu: Tick: 41, FPS: 27.78 / render=121us blit=36109us frame=36234us
I (2877) telsu: Tick: 55, FPS: 27.78 / render=120us blit=36114us frame=36238us

Rendering 160x120 is exactly twice as fast as that:

I (1347) telsu: Tick: 27, FPS: 55.56 / render=121us blit=18142us frame=18267us
I (1857) telsu: Tick: 55, FPS: 55.56 / render=119us blit=18133us frame=18256us
I (2367) telsu: Tick: 83, FPS: 55.56 / render=118us blit=18130us frame=18252us

EDIT 2

Frame: 9, Avg frame time: 15097 usec; FPS: 66
Frame: 19, Avg frame time: 15204 usec; FPS: 65
Frame: 29, Avg frame time: 15203 usec; FPS: 65
Frame: 39, Avg frame time: 15203 usec; FPS: 65
Frame: 49, Avg frame time: 15203 usec; FPS: 65

almindor commented 2 weeks ago

Not a great improvement

That's unfortunate. I'd expect a much better improvement. I got myself esp32-C6, once I have some time I'll try to experiment myself and see where the holdups are.

almindor commented 2 weeks ago

Ok so for me, using the esp32-C6 with SPI set to 80Mhz and a 320x240 ST7789 display with CS:

I can get to ~30ms fullscreen draw time when using the new Display::set_pixels_from_buffer so roughly 33FPS, still slow but a bit better.
Using fill_solid drops down to ~70ms so worse than twice slowdown!
Using fill_contiguous I wasn't able to create a full framebuffer with Rgb565, but using a .copied().cycled() iterator it's even slightly worse than fill_solid at about ~72ms

So at least on the C6 this clearly shows that the buffering/abstraction we currently depend on is a big problem, but not the only issue.

~~I'm going to try and see if I can get closer to the C speed (pun intended) by removing the 64byte buffer problem in the display-interface directly next.~~

We should be able, when using set_pixels_from_buffer and no "mid-re-buffering" able to get the full ~60 FPS speeds expected, but I don't know where the additional slowdown could be. At this point it should really just be buffer sends.

almindor commented 2 weeks ago

I did some more testing and if I deconstruct the SPI back after sending all the init stuff and prepping a data window, I'm able to send the raw buffer directly using the SPI, without and display-interface involvement, but it still clocks at exactly the same speed as the set_pixels_from_buffer.

This means that the SPI implementation of the esp32-hal (I'm not using IDF) is probably sub-optimal somehow.

I also identified that esp-hal will default to 80Mhz for CPU clock speed, after fixing that I got overall improvement but still not good enough.

Full code here

NOTE: you need prep_pixels_from_buffer added to mipidsi for this to work, the code for that is also in the GIST at the bottom. It's just a hack to get this going for testing, the basic idea is to see if raw SPI direct u8 buffer transfer is the culprit here, which it seems to be.

My results are:

INFO - running at 160000000 Hz
INFO - Spi created
INFO - SpiDevice created
INFO - DI created
INFO - Display created
INFO - solid drawing time 44875 us
INFO - from_buffer drawing time 24692 us
INFO - direct drawing time 24666 us

I'm wondering if:

  printf("Initializing SPI bus!\n");
  ESP_ERROR_CHECK(spi_bus_initialize(
      LCD_HOST, &buscfg, SPI_DMA_CH_AUTO)); // Enable the DMA feature

This might explain the discrepancy here. @rfuest what do you think? I'm guessing perhaps the DMA feature isn't done when using Rust code? I found a reference to SpiDmaBus but I'm unsure how to actually instantiate this. I suspect it's what's used in the C code with their DMA_CH_AUTO setup.

almindor commented 2 weeks ago

@rfuest @akx Found it! Note I used only esp-hal not the IDF/std abstraction.

The main issues are:

CPU default of esp-hal is 80Mhz, so make sure to set it to 160Mhz manually
Spi by default will not use DMA buffers, you need to set it up explicitly
Using the new Display::set_pixels_from_buffer is cruicial due to what we discussed originally

The code for the "fixed up" version is here

The result now looks like this:

INFO - solid drawing time 44062 us
INFO - from_buffer drawing time 16103 us
INFO - direct drawing time 16050 us

Which means ~62.5 FPS.

I'll keep this issue opened until we release mipidsi with the new set_pixels_from_buffer method. The display-interface situation still needs thinking over as well of course as it does cause a major slowdown.

almindor / mipidsi

How to speed things up (ST7789V2 + ESP32-C6-DevKitC-1) #142

EDIT

EDIT 2