Idea: Faster LCD redraw with 16 MHz I2S (instead of 8 MHz SPI)

hubmartin commented 3 years ago

[ ] I searched for similar feature request and found none was relevant.

Pitch us your idea!

Faster display redraw

Description

Hi there,

Short story:

It seems like it could be possible to use I2S peripheral as a "faster SPI" running at 16 MHz (double of SPI speed)

Long story:

I could be wrong, Iooking for some feedback. I'm a low-level embedded developer and I like challenges like driving a lot of smart LEDs with not-so-fast processors. Since NRF52832 has SPI divisor which can run at maximum 8 MHz speed, I tried to look around on other peripherals and see what could be use. PWM and other peripherals end at 8 MHz too, but I2S looked promising.

InfiniTime and ST7789 is using SPI MODE3, which means CLPH = 1 and CLKPOL = 1. It looks like this on logic analyzer

SPI 8 MHz SLPOUT 0x11 command

Sampling is on the rising edge. Fortunatelly the I2S format is similar and data are changed when clock is low and data are present on the rising edge.

By using i2s example from NRF52 SDK \examples\peripheral\i2s I was able to generate same waveform with steady data on the clock rising edge.

here from the scope

Signal above is 4 MHz and my Saleae clone analyzer has 24 Msps and aliasing issues with 16 MHz signal. So I connected my scope to see 16 MHz signal. Don't be scared with signal integrity, it has no proper grounding. Also it was debugged on hardware which has also some capacity load on those specific pins on the PCB.

If I did not overlooked something this might be usable.

What needs to be done:

Investigate whether the signal is usable at those speeds (NRF can have some limits, however they state that up to 16 MHz clock on I2S is supported). On the 8 MHz the data is stable on rising edge. On 10 and 16 MHz it does not look like the data is stable on rising edge. On 10 MHz the clock stays more time in the positive side, so it seems like I2S is doing some clock magic which could be show-stopper on 10 & 16 MHz. (More screens below in Update)
Add pinmux switching to SPI driver which will choose SPI for SPI FLASH (needs MOSI + MISO) or I2S (only "MOSI") for LCD based on chip_select pin
Check if data could be clocked properly from the first byte. NRF example I tested has some zero bytes in the begining, but I didn't have time to investigate that whether it is code or peripheral issue (image below)

Update:

Did again measurement with only scope connected to those signals and measured 8, 10 & 16 MHz signals. It looks that with 10 & 16 MHz the clock is really out of phase and on the clock rising edge also data is changing. 8 MHz seems fine, data stable on rising clock

10 MHz and above seems to me that data is not stable on rising clock.. :(

If anyone is interested I can ZIP and share my edited i2s example project. Did some ugly changes and not sure what I changed exactly from the default code/SDK configuration.

Unfortunatelly I do not have time or development kit to continue, so here my journey ends. It would be great if this idea will be reviewed by other developers and tested by someone who can create basic working proof of concept.

TT-392 commented 3 years ago

Did some testing using my scope, saw the same phase shift problem, but it turns out that was my scope loading it down too much. Switched to 10x mode, signal actually looks fine. scope_4

hubmartin commented 3 years ago

Thanks @TT-392 for testing. I have also 10x probes, but probably the device to which NRF52 is soldered has some traces that weren't designed for 16 MHz. So this might be actually usable after all! Do you have further plans with this to create some proof of concept or improve low level driver? Thanks

TT-392 commented 3 years ago

I am currently playing around with it in my own display driver (I got my own os). And after I am done experimenting with it in there, I am planning to look into implementing it in infinitime.

JF002 commented 3 years ago

That's a great idea! @hubmartin @TT-392 keep us up to date with your findings !!

TT-392 commented 3 years ago

scope_5 It looks like nrf52 i2s has the clock running for like 8 bytes before actually sending any data. And it doesn't look like PPI + EVENTS_RXPTRUPD is fast enough to get something like a reliable chip select to ignore that data. Could probably do some magic with a timer, but it is probably easyer to just add 8 bytes at the end to overwrite the first 4 pixels. Also, unlike SPIM, I2S keeps looping and resending the buffer, so you have to stop it at some point, and yet again PPI + EVENTS_RXPTRUPD is not fast enough to cut off a byte, so hopefully if you send the display 6 bits at the end of the transmission, it just ignores that.

hubmartin commented 3 years ago

@TT-392 What about changing the I2S clock divider on the fly? You could start with speed like 4-8 MHz for the first few bytes, then by PPI / timer enable CS correctly, then you ramp up the speed. The similar could be done in the end. I'm aware that this is hacky but let's try what the hardware will think :)

Another solution at stopping the tranfser. To the end the LCD window (in the lcd display) wraps around, so theoretically you can add to the buffer again data from the begining from the framebuffer and stop it at any time. Some pixels from the begining will get redrawn second time, but it wouldn't be visible. And you don't care if you stop transfer at the byte boundary or at let's say 2nd byte and 5th bit. However not sure what the LVGL drawing looks like and if you have all the buffer at hand when drawing bigger area.

TT-392 commented 3 years ago

I tried the changing the frequency on the fly, it seemed like a nice idea, but it turned out that the EVENTS_RXPTRUPD is just generated on an awkward moment, no matter the frequency. I have been playing around with the display a bit, and it doesn't look like the display minds there being some extra bits at the end of the transmission, you just have to toggle chip select to reset where the next byte starts. Also, as far as I can tell, no matter the CONFIG.FORMAT, the clock switches 65 times before the actual transmittion, this is a problem for the display, but my solution of just bitbanging 7 bits before the transmission seems to work. As for the first bytes being empty, overwriting them at the end of the bitmap could be an option.

I might also do some experimenting with actually toggeling the cmd pin half way through the transmission, that way the actual bitmap that is send to the display starts where it is supposed to start, and you got a little speed boost at the start of the transmission. Though not entirely sure how an implementation of that would work yet, good chance it'd mean 3 command nop bytes at the start of each command, which, I am pretty sure is still a lot faster than the old solution.

TT-392 commented 3 years ago

I2S

Ok, so, here is the scope image of me writing a 2x2 square of white pixels to the display. The nice thing here is that, because of the way I2S works (looping DMA data, with the ability to change the data pointer without ever stopping the peripheral) we can get handy events (at a minimum interval of every 4 bytes) we can use to toggle the cmd pin. Using this to send the commands and command parameters, we get a faster driver that also gets rid of the leading 0's problem.

The 7 bits ata the start are because the amount of leading 0's is not divisable by 8. The NOPs are there because the fastest the command pin can be switched is every 4 bytes.

The cmd pin is switched by using a PPI to connect EVENTS_RXPTRUPD and a GPIOTE driver for the cmd pin. The EVENTS_RXPTRUPD happens a few bits before the last of the 4 bytes, and because the command pin is sampled at the end of each byte, the last byte of each command packet becomes a cmd byte, and the last byte of each cmd packet becomes a data byte.

As far as I can tell, the I2S peripheral can only work at 16MHz if SWIDTH >= 16, this means the transfers are in 2 byte words, this results in 4 bytes per RXTXD.MAXCNT. These words are little endian, so all of the byte pairs within the transfer have to be flipped.

hubmartin commented 3 years ago

@TT-392 wow, now I understand those 7 bit banged bits and CMD signal switched by PPI. That's really clever how you solved that non dividable 65 bits by 8 in the begining. :+1: How big problem is it to flip every 2 bytes regarding LVGL? I'm not familiar with LVGL. But AFAIK I saw somewhere in configuration the format. So might this flip be solved internally in LVGL? Maybe this is it.

/* Swap the 2 bytes of RGB565 color.
 * Useful if the display has a 8 bit interface (e.g. SPI)*/
#define LV_COLOR_16_SWAP   1

Thanks for your effort!

TT-392 commented 3 years ago

@TT-392 wow, now I understand those 7 bit banged bits and CMD signal switched by PPI. That's really clever how you solved that non dividable 65 bits by 8 in the begining. +1 How big problem is it to flip every 2 bytes regarding LVGL? I'm not familiar with LVGL. But AFAIK I saw somewhere in configuration the format. So might this flip be solved internally in LVGL? Maybe this is it.
/* Swap the 2 bytes of RGB565 color.
 * Useful if the display has a 8 bit interface (e.g. SPI)*/
#define LV_COLOR_16_SWAP   1
Thanks for your effort!

So, like, nice idea, but, I don't think that is gonna work. The cmd pin is toggled before the end of each transfer, and, because this pin is sampled on the end of each byte, that means that the actual data is offset by one byte, so each 16 bit int contains 2 half pixels. So, I guess, it is hoping that just swapping the bytes before / during the transfer is not gonna take too much time.

TT-392 commented 2 years ago

Ok, so, a little update.

First, a small annoying property of I2S on the nrf52 that I didn't mention yet, This is that the data on the pins seems to be 8 bytes behind the cpu (it looks like it is going through like a fifo buffer or something causing the delay), In practice, this means that, when you toggle the command pin, you actually toggle it for the data that is 8 bytes in the past.

Second, I have been working on a prototype driver. This prototype driver tries to implement the result in the earlyer pic I posted earlyer in a more usable way. I finally got that working, but it turns out that if stuff just happens to compile slightly differently, the interrupt routine just ends up being too slow (and I don't think there is much more that can be optimized there, except for maybe going inline assembly). So I am gonna need a different approach. I have thought about using the LRCK as a command pin and disabling it when transitioning to the color data, but, sadly, the driver doesn't like something like: CASET, X1 >> 8, x1 & 0xff, NOP, NOP, X2 >> 8, x2 & 0xff. I think I am just gonna do the approach I had but increasing the minimum transfer size (and thus the minimum time between CMD pin flips).

Riksu9000 commented 2 years ago

Any updates on this?

TT-392 commented 2 years ago

Any updates on this?

Since I posted my last message, I have started my internship, which, is fulltime, so I haven't really had time, and I also had covid until last week and I still feel a bit shit from that. I have kinda given up on writing an infinitime driver, because I have like no C++ experience, and I had quite a hard time figuring out how to do that properly. I still want to write a simple example driver though, since I have basically figured out how to do a bunch of this stuff at this point. And I kinda hope I find the time and energy to get that done in the next week or so.

TT-392 commented 2 years ago

Well, it took some time, but here is the example driver, feel free to ask questions about the code.

https://github.com/TT-392/pinetime-I2S-display-driver-example

TT-392 commented 2 years ago

Oh, as a bit of extra explanation, I ended up simplifying stuff by just writing a RAMWR + block of pixels function, the rest of the commands are just SPIM.

JF002 commented 2 years ago

Awesome! :1st_place_medal: I'm really curious to see how well it'll run on InfiniTime ! I'll definitely test as soon as I get some time!

TT-392 commented 2 years ago

Just like, keep in mind that it will need some work for it to work on infinitime, but for someone familiar with the source code I don't think that part will be too hard (I mainly didn't have the energy to do so because C++ and rtos are a bit overwelming for me).

joaquimorg commented 2 years ago

Thanks @TT-392, I made a little test using LVGL to compare SPI and I2S

https://user-images.githubusercontent.com/1682318/158653154-eb9c343e-dd4b-4a03-98df-182240570659.mov

tigercoding56 commented 1 year ago

Wow

InfiniTimeOrg / InfiniTime