Open akx opened 4 days ago
Hmm I don't see anything particularly wrong in the code, but AFAICS you should be able to push roughly 60fps with 80Mhz SPI. Of course in practice that's going to be a bit lower (and I only did my napkin math real quick I might've made a mistake here).
For one full screen of filler you're doing 320x240x2x8 bits so 1_228_800
bits to be pushed over at 80Mhz should fit roughly 65 of those a second.
One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.
The display-interface
crate and embedded-hal
1.0 add quite a bit of overhead, which slows down the display update rate. To work with the new SpiDevice
trait display-interface-spi
buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch in display-interface
the performance suffers noticeable for fast displays.
The
display-interface
crate andembedded-hal
1.0 add quite a bit of overhead, which slows down the display update rate. To work with the newSpiDevice
traitdisplay-interface-spi
buffers the data in 64 byte chunks before it passes it to the HAL. And to make things worse the CS pin is toggled for each chunks. Combined with the unnecessary dynamic dispatch indisplay-interface
the performance suffers noticeable for fast displays.
Hmm that's quite unfortunate. Do you think we should abandon display-interface
use? I'm also curious why 64 byte chunks are necessary? I don't remember that being any kind of requirement when we discussed the e-h 1.0 SPI traits.
Thank you for the response and thoughts.
One thing I do not see though in your code is setting the MCU's processing speed/clocks. I don't know how ESP32 inits but I remember that on my hifive1 I had to set my clock to full. In my case it's 320Mhz, in yours it's 160Mhz from what I see. The default was something quite lower IIRC. It might be you're just running on lower clocks on your MCU.
The full log (which of course I should've posted in the first place, but here) says
I (220) cpu_start: Pro cpu start user code
I (221) cpu_start: cpu freq: 160000000 Hz
so unless something in the ESP IDF drops that down after the app starts, I think I should be running at 160MHz.
I'm also curious why 64 byte chunks are necessary?
That seems to be this bit of code.
Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.
EDIT: Changing that buffer size in a local copy of display-interface-spi
:
buffer size | result |
---|---|
64 | FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us |
128 | FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us |
256 | FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us |
512 | FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us |
1024 | FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us |
3840 | Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄 |
I'm also curious why 64 byte chunks are necessary?
That seems to be this bit of code.
Looks like the intent is to read a 64-byte chunk from the pixel iterator, endian-flip it and send it off.
EDIT: Changing that buffer size in a local copy of
display-interface-spi
:buffer size result 64
FPS: 9.26 / render=120us clone=649us blit=107343us frame=108116us
128FPS: 9.90 / render=123us clone=649us blit=100722us frame=101498us
256FPS: 10.31 / render=119us clone=649us blit=96933us frame=97705us
512FPS: 10.42 / render=122us clone=650us blit=95296us frame=96072us
1024FPS: 10.53 / render=123us clone=650us blit=94953us frame=95730us
3840 Guru Meditation Error: Core 0 panic'ed (Stack protection fault). 😄
Perfect, I was hoping to see how this goes. It seems the flipflopping of buffer isn't the main slowdown here. Making the buffer bigger improves it but not in a linear fashion at all, considering the 1024 case should've been 16 time better in theory and is < 10% improvement.
Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.
I did some testing and I think the ESP HAL has quite a bit of overhead for the kind of transfers we use. I'm currently testing on a ESP32-C3 with the no_std HAL. With some tweaks I was able to get the time to clear the framebuffer down to ~60ms, which still isn't great.
Using a logic analyzer did reveal some interesting patterns:
For this capture I've used a display-interface-spi
buffer size of 256 bytes and a SPI clock of 40 MHz (I did set 60 MHz in my program, but I measured 40 MHz). The HAL splits up every 256 byte buffer into four 64 byte blocks, because that is the SPI FIFO size (https://docs.esp-rs.org/esp-hal/esp-hal/0.21.1/esp32c3/esp_hal/spi/master/struct.Spi.html#method.write_bytes). While the 64 byte sub blocks are transferred at full SPI clock speed without any gaps between bytes, the gaps between the 64 byte blocks cause a significant overhead. This alone causes the write performance to drop by ~33%.
I'm not familiar enough with the ESP32-C3 internals to know if this dead time could be reduced. The "best" solution would probably be a framebuffer in RAM, which is then transferred via DMA to the display, but that does require a lot of RAM and there is currently no support for DMA in this create and the embedded-graphics ecosystem in general.
Could you try changing fill_congiguous into a fill_solid, or possibly calling set_pixels directly? I just want to make sure it's not something funky.
I changed the draw bit to
display.fill_solid(&fullscreen, if i % 2 == 0 { Rgb565::BLACK } else { Rgb565::WHITE }).unwrap();
I (1347) telsu: FPS: 10.10 / render=125us clone=4us blit=99427us frame=99560us
I (1947) telsu: FPS: 10.10 / render=121us clone=4us blit=99424us frame=99552us
I (2547) telsu: FPS: 10.10 / render=125us clone=4us blit=99425us frame=99558us
I (3147) telsu: FPS: 10.10 / render=124us clone=4us blit=99426us frame=99558us
which is still kind of unfortunate (and funky, I think?)...
For what it's worth, changing ST7789::write_pixels
to use U16LEIter
instead of U16BEIter
(i.e. that the buffer doesn't need to be byte-swapped (though naturally then the data sent to the display is pretty bogus unless treated elsewhere), blit=100382us -> blit=98454us for the original test case.
I've worked on this a bit more and got the update reasonably fast by using DMA and bypassing the iterator. My implementation uses two DMA buffers, which makes it possible to fill one while the other is being transmitted. There are still gaps between the individual blocks, but the overall impact of this overhead is now fairly low.
This really shows that using an dynamically dispatched iterator to transfer lots of data is a bad idea. Even with two DMA buffers the time it takes to fill one buffer up is much longer than the time it takes to transmit a buffer. This causes the SPI bus to be idle about 80% of the time, which is why it still takes 70ms to update the display at a 80 MHz SPI clock.
But if we bypass the iterator and use DataFormat::U8
to transmit the data to the display the transfer is much faster. Filling up the new buffer now only takes a fraction of the time it takes to transmit one buffer. With 95% SPI bus utilization the display now gets updated in ~16ms.
I've uploaded my custom display-interface
in this GIST. The draw_fast
method is a bit of a hack, but has the largest impact on performance.
@rfuest Nice work! Am I understanding correctly that since the Framebuffer
is marked BigEndian
, the price for converting from native (little-endian) pixel formats to big-endian is paid when drawing to the buffer instead of when transferring, and since draw_fast
just does a "trust me bro, these bytes are what we want to push out to the bus", that works out? 😄
Yes, that's correct. Some API changes will be necessary to make this safer to use without the chance of accidentally using the wrong endianness.
In my code the CPU is often just waiting for the DMA transfer to finish and converting a little-endian framebuffer into a big-endian format during that time wouldn't negatively impact the time it takes to update the display. But the display-interface
crate doesn't provide a way to specify the endianness of a &[u8]
buffer. This hack works and is as fast as before, but it is on a whole other level of "trust me bro".
So from what I understand there are a few issues here:
display-interface
forcing 64 bit buffer sizedisplay-interface
forcing endianness runtime conversiondisplay-interface
using dynamic dispatchOptionally I think another issue is that:
embedded-hal
not providing a native iterator SPI write/transfer/read
methodsI think if embedded-hal
actually had iterator support, and if we figured out a better abstraction for display-interface
we should be able to get a fairly quick write using an iterator with 0 additional buffering?
display-interface forcing 64 bit buffer size
display-interface
doesn't force any buffer size, but the implementation for SPI in display-interface-spi
does use a 64 word buffer to convert the iterator based DataFormat
s into a buffer of bytes that can be passed to the SpiDevice
impl.
display-interface forcing endianness runtime conversion
No it doesn't force runtime conversion, but It only support different endianness with u16
buffers with the DataFormat::{U16BE, U16LE}
variants. In my opinion it is inconvenient that you need to cast a more common &[u8]
buffer to &[u16]
first, which to my knowledge requires an external crate like byte-slice-cast
(which the display-interface
impls use internally).
display-interface
using dynamic dispatch
This has been an issue for a long time and there is even an open draft PR from 2021 with a possible solution in the display-interface
repo, but nobody seems to have cared about that: https://github.com/therealprof/display-interface/pull/19
I'm not sure if we (or someone) should fix display-interface
, which seems to be more or less unmaintained, or if it is better to create a new alternative. In any case this will need some careful design to get this right.
embedded-hal
not providing a native iterator SPIwrite/transfer/read
methodsI think if
embedded-hal
actually had iterator support, and if we figured out a better abstraction fordisplay-interface
we should be able to get a fairly quick write using an iterator with 0 additional buffering?
This might help for some drawing operations, but I'm not sure it is needed. Randomly drawing individual pixels over a relatively slow connection like SPI always comes with a lot of overhead. For images we shouldn't use iterators at all.
I'm planning to add a something like draw_image(&mut self, image: &ImageRaw<...>, position: Point)
to e-g's DrawTarget
trait, which will allow DrawTarget
implementations to use a more efficient code path in that case. And because ImageRaw
is just a thin wrapper around a byte slice, that specifies the image dimensions, color format, and byte order, this could also be used for other applications like writing an entire RAM framebuffer to a display. I'm still not sure about the details, but ideally this addition to e-g and the changes to display-interface
would make it possible to use DMA transfers for displaying images in more cases. And this might also be a stepping stone towards async support.
One thing that mipidsi
can do here is to introduce something more than just draw_pixels
to the model.
If we allowed something like write_pixels_raw_u8()
with &[u8]
and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.
That could then be used by the expanded e-g
as well as internally where possible. I think all it'd take really is to add a blanket implementation into Model
here as first step on the mipidsi
side.
If you see my luluu repo, I went through the process of getting within a few percent of the theoretical bandwidth limits of the raspi 2040 spi device. There is some interesting stuff in the main firmware but in particular see the patches in the vendored crates, including the rp-hal implementation and mipidsi.
If we allowed something like write_pixels_raw_u8() with &[u8] and exposed or hint-switch it somehow it'd allow to bypas the 16LE/BE shenenigans if the user knows that the data is ready for the display directly.
I believe I did something very similar to what you're describing here.
It's not clear to me how much these "fixes" can or should actually be upstreamed but maybe it's useful for reference https://github.com/fu5ha/luluu/tree/main/software
@rfuest Lmk what you think about https://github.com/almindor/mipidsi/pull/143
@fu5ha Thanks I'll have a look!
Heya,
I'm wondering how to speed things up for an application that will likely need full-screen updates most of the time.
I have an ESP32-C6-DevKitC-1 and a WaveShare 280x240 1.69" LCD module, using
esp-idf-svc
(at present anyway).My current experiment code (please excuse the mess, it's an experiment so far) is
and the interesting (performance) output for a
--release
build is:IOW, 99.2% of the frame time is spent in
fill_contiguous
. Setting the SPI baudrate to something lower than 80 MHz (which, AIUI, is already pushing it especially given my display is behind 10-centimeter DuPont wires 🤠) doesn't change things a lot; blit time becomes about 123547us).Is there something obvious I'm doing wrong for an application like this, where I basically just have a buffer of Rgb565 to push to the screen?
And of course thank you for the work you've put into the library and the ecosystem at large! I was surprised to see things working at all (after, of course, having heeded the big red instructions on WaveShare's wiki and powered the display from 3V3 and not 5V...).