almindor / mipidsi

MIPI Display Serial Interface unified driver
MIT License
119 stars 46 forks source link

How to speed things up (ST7789V2 + ESP32-C6-DevKitC-1) #142

Open akx opened 4 days ago

akx commented 4 days ago

Heya,

I'm wondering how to speed things up for an application that will likely need full-screen updates most of the time.

I have an ESP32-C6-DevKitC-1 and a WaveShare 280x240 1.69" LCD module, using esp-idf-svc (at present anyway).

My current experiment code (please excuse the mess, it's an experiment so far) is

use std::collections::VecDeque;
use std::time::Instant;
use esp_idf_svc::hal::gpio::PinDriver;
use esp_idf_svc::hal::prelude::*;
use crate::led::WS2812RMT;

use embedded_graphics::{
    prelude::*,
    primitives::Rectangle,
};
use embedded_hal::spi::MODE_3;
use esp_idf_svc::hal::delay::Ets;
use esp_idf_svc::hal::spi::config::{BitOrder, Config, Duplex};
use esp_idf_svc::hal::spi::{SpiDeviceDriver, SpiDriverConfig};
use display_interface_spi::SPIInterface;
use embedded_graphics::pixelcolor::{Rgb565, Rgb888};
use mipidsi::models::ST7789;
use mipidsi::options::{ColorInversion, Orientation, RefreshOrder, Rotation};
use rand::{Rng, SeedableRng};
use rand::rngs::SmallRng;
use crate::ticktock::Ticktock;

mod led;
mod ticktock;

fn main() -> anyhow::Result<()> {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    log::info!("Hello, world!");
    let peripherals = Peripherals::take()?;
    log::info!("Initializing LED");
    let mut led = WS2812RMT::new(peripherals.pins.gpio8, peripherals.rmt.channel0)?;
    led.set_pixel(50, 50, 0)?;
    let mut rng = SmallRng::from_entropy();
    log::info!("Initializing pins");
    let spi = peripherals.spi2;
    let mut backlight = PinDriver::output(peripherals.pins.gpio23)?;
    let rst = PinDriver::output(peripherals.pins.gpio22)?;
    let dc = PinDriver::output(peripherals.pins.gpio21)?; // DC (data/command)
    let sclk = peripherals.pins.gpio19; // CLK
    let sda = peripherals.pins.gpio18; // DIN
    let sdi = peripherals.pins.gpio15; // Tied low
    let cs = peripherals.pins.gpio20; // CS
    log::info!("Initializing SPI device");
    let mut delay = Ets;
    let config = Config::new()
        .bit_order(BitOrder::MsbFirst)
        .data_mode(MODE_3)
        .baudrate(80.MHz().into());

    let device = SpiDeviceDriver::new_single(
        spi,
        sclk,
        sda,
        Some(sdi),
        Some(cs),
        &SpiDriverConfig::new(),
        &config,
    )?;
    log::info!("Initializing display_interface_spi");
    let di = SPIInterface::new(device, dc);
    log::info!("Initializing display");
    let mut display = mipidsi::Builder::new(ST7789, di)
        .reset_pin(rst)
        .display_size(240, 320)
        .orientation(Orientation::new().rotate(Rotation::Deg90))
        .invert_colors(ColorInversion::Inverted)
        .init(&mut delay).unwrap();
    log::info!("Setting BL");
    backlight.set_high()?;

    let mut sw = Ticktock::new();
    let mut buf: Vec<Rgb565> = Vec::with_capacity(320*240);
    let alloc_time = sw.tick();
    for i in 0..240 {
        for j in 0..320u16 {
            let val = i ^ (j as u8);
            let px = Rgb888::new(val, val, val);
            buf.push(Rgb565::from(px));
        }
    }
    let initial_fill_time = sw.tick();
    log::info!("Allocated buffer in {alloc_time}us, filled buffer in {initial_fill_time}us",
        alloc_time = alloc_time,
        initial_fill_time = initial_fill_time,
    );

    let mut frame_times: VecDeque<u32> = VecDeque::with_capacity(20);
    let fullscreen = Rectangle::new(Point::new(0,0), Size::new(320,240));
    let mut last_fps_print_time = Instant::now();
    for _i in 1..10000 {
        let mut sw = Ticktock::new();
        for _ in 1..100 {
            let inoff = rng.gen_range(0..buf.len());
            let outoff = rng.gen_range(0..buf.len());
            buf.swap(inoff, outoff);
        }
        let render_time = sw.tick();
        let clon = buf.clone();
        let clone_time = sw.tick();
        display.fill_contiguous(&fullscreen, clon).unwrap();
        let blit_time = sw.tick();
        let frame_time = sw.total();
        frame_times.push_back((frame_time / 1000) as u32);
        if frame_times.len() > 20 {
            frame_times.pop_front();
        }
        if last_fps_print_time.elapsed().as_millis() > 500 {
            led.set_pixel(0, 50, 0)?;
            let fps = 1000.0 / (frame_times.iter().sum::<u32>() as f32 / (frame_times.len() as f32));
            log::info!(
                "FPS: {fps:.2} / render={render_time}us clone={clone_time}us blit={blit_time}us frame={frame_time}us",
                fps = fps,
                render_time = render_time,
                clone_time = clone_time,
                blit_time = blit_time,
                frame_time = frame_time,
            );
            last_fps_print_time = Instant::now();
            led.set_pixel(0, 0, 50)?;
        }
    }
    unreachable!();
}

and the interesting (performance) output for a --release build is:

I (742) telsu: Allocated buffer in 35us, filled buffer in 25514us
I (1292) telsu: FPS: 9.26 / render=120us clone=649us blit=107889us frame=108662us
I (1832) telsu: FPS: 9.26 / render=122us clone=649us blit=107897us frame=108672us

IOW, 99.2% of the frame time is spent in fill_contiguous. Setting the SPI baudrate to something lower than 80 MHz (which, AIUI, is already pushing it especially given my display is behind 10-centimeter DuPont wires 🤠) doesn't change things a lot; blit time becomes about 123547us).

Is there something obvious I'm doing wrong for an application like this, where I basically just have a buffer of Rgb565 to push to the screen?

And of course thank you for the work you've put into the library and the ecosystem at large! I was surprised to see things working at all (after, of course, having heeded the big red instructions on WaveShare's wiki and powered the display from 3V3 and not 5V...).

almindor commented 4 days ago

Hmm I don't see anything particularly wrong in the code, but AFAICS you should be able to push roughly 60fps with 80Mhz SPI. Of course in practice that's going to be a bit lower (and I only did my napkin math real quick I might've made a mistake here).

For one full screen of filler you're doing 320x240x2x8 bits so 1_228_800 bits to be pushed over at 80Mhz should fit roughly 65 of those a second.

One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.

rfuest commented 4 days ago

The display-interface crate and embedded-hal 1.0 add quite a bit of overhead, which slows down the display update rate. To work with the new SpiDevice trait display-interface-spi buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch in display-interface the performance suffers noticeable for fast displays.

almindor commented 4 days ago

The display-interface crate and embedded-hal 1.0 add quite a bit of overhead, which slows down the display update rate. To work with the new SpiDevice trait display-interface-spi buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch in display-interface the performance suffers noticeable for fast displays.

Hmm that's quite unfortunate. Do you think we should abandon display-interface use? I'm also curious why 64 byte chunks are necessary? I don't remember that being any kind of requirement when we discussed the e-h 1.0 SPI traits.

akx commented 3 days ago

Thank you for the response and thoughts.

One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.

The full log (which of course I should've posted in the first place, but here) says

I (220) cpu_start: Pro cpu start user code
I (221) cpu_start: cpu freq: 160000000 Hz

so unless something in the ESP IDF drops that down after the app starts, I think I should be running at 160MHz.

akx commented 3 days ago

I'm also curious why 64 byte chunks are necessary?

That seems to be this bit of code.

Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.

EDIT: Changing that buffer size in a local copy of display-interface-spi:

buffer size result
64 FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us
128 FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us
256 FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us
512 FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us
1024 FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us
3840 Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄
almindor commented 3 days ago

I'm also curious why 64 byte chunks are necessary?

That seems to be this bit of code.

Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.

EDIT: Changing that buffer size in a local copy of display-interface-spi:

buffer size result 64 FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us 128 FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us 256 FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us 512 FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us 1024 FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us 3840 Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄

Perfect, I was hoping to see how this goes. It seems the flipflopping of buffer isn't the main slowdown here. Making the buffer bigger improves it but not in a linear fashion at all, considering the 1024 case should've been 16 time better in theory and is < 10% improvement.

Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.

rfuest commented 2 days ago

I did some testing and I think the ESP HAL has quite a bit of overhead for the kind of transfers we use. I'm currently testing on a ESP32-C3 with the no_std HAL. With some tweaks I was able to get the time to clear the framebuffer down to ~60ms, which still isn't great.

Using a logic analyzer did reveal some interesting patterns: image For this capture I've used a display-interface-spi buffer size of 256 bytes and a SPI clock of 40 MHz (I did set 60 MHz in my program, but I measured 40 MHz). The HAL splits up every 256 byte buffer into four 64 byte blocks, because that is the SPI FIFO size (https://docs.esp-rs.org/esp-hal/esp-hal/0.21.1/esp32c3/esp_hal/spi/master/struct.Spi.html#method.write_bytes). While the 64 byte sub blocks are transferred at full SPI clock speed without any gaps between bytes, the gaps between the 64 byte blocks cause a significant overhead. This alone causes the write performance to drop by ~33%.

I'm not familiar enough with the ESP32-C3 internals to know if this dead time could be reduced. The "best" solution would probably be a framebuffer in RAM, which is then transferred via DMA to the display, but that does require a lot of RAM and there is currently no support for DMA in this create and the embedded-graphics ecosystem in general.

akx commented 2 days ago

Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.

I changed the draw bit to

display.fill_solid(&fullscreen, if i % 2 == 0 { Rgb565::BLACK } else { Rgb565::WHITE }).unwrap();
I (1347) telsu: FPS: 10.10 / render=125us clone=4us blit=99427us frame=99560us
I (1947) telsu: FPS: 10.10 / render=121us clone=4us blit=99424us frame=99552us
I (2547) telsu: FPS: 10.10 / render=125us clone=4us blit=99425us frame=99558us
I (3147) telsu: FPS: 10.10 / render=124us clone=4us blit=99426us frame=99558us

which is still kind of unfortunate (and funky, I think?)...


For what it's worth, changing ST7789::write_pixels to use U16LEIter instead of U16BEIter (i.e. that the buffer doesn't need to be byte-swapped (though naturally then the data sent to the display is pretty bogus unless treated elsewhere), blit=100382us -> blit=98454us for the original test case.

rfuest commented 2 days ago

I've worked on this a bit more and got the update reasonably fast by using DMA and bypassing the iterator. My implementation uses two DMA buffers, which makes it possible to fill one while the other is being transmitted. There are still gaps between the individual blocks, but the overall impact of this overhead is now fairly low.

This really shows that using an dynamically dispatched iterator to transfer lots of data is a bad idea. Even with two DMA buffers the time it takes to fill one buffer up is much longer than the time it takes to transmit a buffer. This causes the SPI bus to be idle about 80% of the time, which is why it still takes 70ms to update the display at a 80 MHz SPI clock.

But if we bypass the iterator and use DataFormat::U8 to transmit the data to the display the transfer is much faster. Filling up the new buffer now only takes a fraction of the time it takes to transmit one buffer. With 95% SPI bus utilization the display now gets updated in ~16ms.

I've uploaded my custom display-interface in this GIST. The draw_fast method is a bit of a hack, but has the largest impact on performance.

akx commented 1 day ago

@rfuest Nice work! Am I understanding correctly that since the Framebuffer is marked BigEndian, the price for converting from native (little-endian) pixel formats to big-endian is paid when drawing to the buffer instead of when transferring, and since draw_fast just does a "trust me bro, these bytes are what we want to push out to the bus", that works out? 😄

rfuest commented 1 day ago

Yes, that's correct. Some API changes will be necessary to make this safer to use without the chance of accidentally using the wrong endianness.

In my code the CPU is often just waiting for the DMA transfer to finish and converting a little-endian framebuffer into a big-endian format during that time wouldn't negatively impact the time it takes to update the display. But the display-interface crate doesn't provide a way to specify the endianness of a &[u8] buffer. This hack works and is as fast as before, but it is on a whole other level of "trust me bro".

almindor commented 1 day ago

So from what I understand there are a few issues here:

Optionally I think another issue is that:

I think if embedded-hal actually had iterator support, and if we figured out a better abstraction for display-interface we should be able to get a fairly quick write using an iterator with 0 additional buffering?

rfuest commented 23 hours ago

display-interface forcing 64 bit buffer size

display-interface doesn't force any buffer size, but the implementation for SPI in display-interface-spi does use a 64 word buffer to convert the iterator based DataFormats into a buffer of bytes that can be passed to the SpiDevice impl.

display-interface forcing endianness runtime conversion

No it doesn't force runtime conversion, but It only support different endianness with u16 buffers with the DataFormat::{U16BE, U16LE} variants. In my opinion it is inconvenient that you need to cast a more common &[u8] buffer to &[u16] first, which to my knowledge requires an external crate like byte-slice-cast (which the display-interface impls use internally).

  • display-interface using dynamic dispatch

This has been an issue for a long time and there is even an open draft PR from 2021 with a possible solution in the display-interface repo, but nobody seems to have cared about that: https://github.com/therealprof/display-interface/pull/19

I'm not sure if we (or someone) should fix display-interface, which seems to be more or less unmaintained, or if it is better to create a new alternative. In any case this will need some careful design to get this right.

  • embedded-hal not providing a native iterator SPI write/transfer/read methods

I think if embedded-hal actually had iterator support, and if we figured out a better abstraction for display-interface we should be able to get a fairly quick write using an iterator with 0 additional buffering?

This might help for some drawing operations, but I'm not sure it is needed. Randomly drawing individual pixels over a relatively slow connection like SPI always comes with a lot of overhead. For images we shouldn't use iterators at all.

I'm planning to add a something like draw_image(&mut self, image: &ImageRaw<...>, position: Point) to e-g's DrawTarget trait, which will allow DrawTarget implementations to use a more efficient code path in that case. And because ImageRaw is just a thin wrapper around a byte slice, that specifies the image dimensions, color format, and byte order, this could also be used for other applications like writing an entire RAM framebuffer to a display. I'm still not sure about the details, but ideally this addition to e-g and the changes to display-interface would make it possible to use DMA transfers for displaying images in more cases. And this might also be a stepping stone towards async support.

almindor commented 21 hours ago

One thing that mipidsi can do here is to introduce something more than just draw_pixels to the model.

If we allowed something like write_pixels_raw_u8() with &[u8] and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.

That could then be used by the expanded e-g as well as internally where possible. I think all it'd take really is to add a blanket implementation into Model here as first step on the mipidsi side.

fu5ha commented 20 hours ago

If you see my luluu repo, I went through the process of getting within a few percent of the theoretical bandwidth limits of the raspi 2040 spi device. There is some interesting stuff in the main firmware but in particular see the patches in the vendored crates, including the rp-hal implementation and mipidsi.

If we allowed something like write_pixels_raw_u8() with &[u8] and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.

I believe I did something very similar to what you're describing here.

It's not clear to me how much these "fixes" can or should actually be upstreamed but maybe it's useful for reference https://github.com/fu5ha/luluu/tree/main/software

almindor commented 20 hours ago

@rfuest Lmk what you think about https://github.com/almindor/mipidsi/pull/143

@fu5ha Thanks I'll have a look!