david-res / ILI948x_t4_mm

A basic display driver for ILI948X series on a Teensy Micromod
5 stars 3 forks source link

strange bug.... #1

Open vindar opened 1 year ago

vindar commented 1 year ago

Hi David,

While trying your library, I am encountering a very strange bug... I have been pulling my hair for the last couple of hours and yet I cannot figure out what the problem is...

Here is the simplest code I could make that exhibits the bug.

#include <ILI948x_t4_mm.h>

#define LCD_DC 13 
#define LCD_CS  11
#define LCD_RST 12
#define LCD_TE 2
#define LCD_RD 3

#define LX 320
#define LY 240

ILI948x_t4_mm lcd = ILI948x_t4_mm(LCD_DC, LCD_CS, LCD_RST);

// 3 framebuffers with size 320x240
uint16_t fb1[LX * LY];
uint16_t fb2[LX * LY];
DMAMEM uint16_t fb3[LX * LY];

volatile int dma_part = 0; // DMA flag: 0 = idle, 1 = uploading upper half, 2 = uploading lower half. 

/** dma callback */
void cbDMA()
    {
    if (dma_part == 1)
        { // finished uploading the upper half. now we upload the lower half [0,320]x[240,479]
        dma_part = 2;
        lcd.pushPixels16bitDMA(fb2, 0, LY, LX - 1, 2 * LY - 1);
        } 
    else
        { // we are done
        dma_part = 0;
        }
    }

/** upload to the screen: we first upload thez upper half and then the lower half when DMA callback is called */
void draw()
    {
    while (dma_part != 0) { yield(); } // wait for previous transfer to complete 
    dma_part = 1;
    lcd.pushPixels16bitDMA(fb1, 0, 0, LX - 1, LY - 1); // upload the upper half [0,320]x[0,239]
    }

/** this method returns only when it is safe to draw onto im1 (i.e. upload of upper part is finished) */
void waitPart1()
    {
    while (dma_part == 1) { yield(); }
    }

/** this method returns only when it is safe to draw onto im1 (i.e. upload is finished) */
void waitPart2()
    {
    while (dma_part != 0) { yield(); }
    }

void setup()
    {
    Serial.begin(115200);
    Serial.print(CrashReport);

    pinMode(LCD_RST, OUTPUT);
    digitalWrite(LCD_RST, HIGH);
    pinMode(LCD_TE, INPUT);
    pinMode(LCD_RD, OUTPUT);
    digitalWrite(LCD_RD, HIGH);

    lcd.begin(40);
    lcd.setRotation(0);
    lcd.setFrameRate(78);
    lcd.onCompleteCB(cbDMA);

    }

/** perform some 'stupid busy work' on a given framebuffer */
void stupidBusyWork(uint16_t * p)
    {
    elapsedMicros mm = 0;
    const int l = random(10, 8000);
    uint16_t ii = 0;    
    while (mm < l)
        {
        for (int i = 0; i < 10000; i++) { p[random(LX) + LX * random(LY)] = ii++; } 
        }
    fb3[0] = 10; 
    fb3[1] = 11;
    fb3[2] = 12;
    fb3[3] = 13;
    fb3[4] = 14;
    fb3[5] = 15;
    fb3[6] = 16;
    fb3[7] = 17;
    fb3[8] = 18;
    fb3[9] = 19;
    fb3[10] = 20;    
    for (int i = 0; i < LX * LY; i++) fb3[i] = 0;
    }

void loop()
    {
    Serial.print(".");

    waitPart1();
    stupidBusyWork(fb1);

    waitPart2();
    stupidBusyWork(fb2);

    draw();
    }

When I run this code, its works for a little time (usually less than 1 second) and then hangs: fixed screen, no more serial output of "."...

This code simply draws random noise on the screen. The upload is performed in two parts (first upper half fb1 and then lower half fb2) chained via the DMA callback. There is a third framebuffer fb3 which serves no purpose (but removing it removes the bug).

In order to see the bug, the code must be compiled with Teensyduino 1.57 (i.e. the latest stable version) and with "faster" for the optimization option. With another optimization (fast, fastest...) the bug may not be there or take much longer to appear.

Also changing a single line of this code can makes the bug disappear. This indicates that the way the code is laid out in memory matters and this hints very strongly at a memory access violation somewhere... Yet, I have checked this code over and over and I cannot see any problem with it. It does not make any out of bound access... So I am thinking that maybe the problem is with your library (or the flexio library that you use) ?

I know this bug looks artificial like that but the same problem occurs when I try to use the TGX graphic library combined with your library...

david-res commented 1 year ago

Hi Arvind,

Does the bug manifest when doing non DMA transfers?

Can you check if it’s hanging here Or in any of these while loops scattered throughout the library? while(WR_DMATransferDone == false)

I also use some NOP commands in a loop to delay the D/C pin toggle when writing, as without it, it sometimes toggles to the opposite state before the last byte is transferred.

I would suspect either of these three being the root cause.

vindar commented 1 year ago

Hi David

Hi Arvind,

Does the bug manifest when doing non DMA transfers?

The bug is really about chaining DMA transfers (calling pushPixels16bitDMA() from inside the DMA callback).

Can you check if it’s hanging here Or in any of these while loops scattered throughout the library? while(WR_DMATransferDone == false)

It is more problematic than that, because the Teensy does not just hang (that would be easy to debug :)). Sometimes (but not always) the Teensy completely crashes with an error like that:

A problem occurred at(system time) 15:4 : 15
Code was executing from address 0x2270
CFSR : 82
(DACCVIOL)Data Access Violation
(MMARVALID) Accessed Address : 0x1F687140
Temperature inside the chip was 58.51 °C
Startup CPU clock speed is 600MHz
Reboot was caused by auto reboot after fault or bad interrupt detected

So it really looks like memory becomes corrupted and the code jumps to arbitrary code segments, sometimes this just make the program 'hang' but sometimes it completely crashes the Teensy... I tried to check for buffer overflows where code instructions or the stack could be overwritten and become corrupted but I did not find anything (but since I do not know how flexio works my analysis is only superficial).

I also use some NOP commands in a loop to delay the D/C pin toggle when writing, as without it, it sometimes toggles to the opposite state before the last byte is transferred.

Yep , saw that but I do not think this is the problem. I increased the delay and the bug remained...

I would suspect either of these three being the root cause.

I think there is a deeper problem here but it is very difficult to debug because just placing a Serial.print() somewhere changes the behavior of the code. It is really a shame the Teensy does not have an (easily accessible) hardware debugger available. Here, it would prove really useful !

Also, looking at your code I have found other issues (but none that solves the bug unfortunately):,

  1. You should remove the variables defined at line 614-617 of ILI948x_t4_mm.cpp:

    bool WR_DMATransferDone = true;
    uint32_t MulBeatCountRemain;
    //uint16_t *MulBeatDataRemain;
    uint32_t TotalSize; 

    There look like left-over from when you used global variables instead of class member variables... But right now, these variables are never used since they are shadowed by the class member variables with the same name...

  2. Also, about these (member) variables, and in particular WR_DMATransferDone, you MUST add the volatile keyword to their definition because otherwise the compiler is allowed to replace statements like:

    while(WR_DMATransferDone == false)
    {
    }

    by a single test because it is can assume that the value WR_DMATransferDone will not change inside the loop. Adding the volatile keyword tells the compiler that this variable can change at any time and so it cannot optimize away the while loop... In fact, any variable that are used in both interrupt handler and outside code (or which are assessed by other peripheral) must be marked volatile !

I will keep trying to find the bug a little more but I am running out of idea and I do not know enough (and do not have the courage to read up) about flexio so I do not think I will be successful..

david-res commented 1 year ago

So I've made a few changes.

The issue you are facing - I wish I could help out. Perhaps it's down to timing? As when the dma callback triggers, FlexIO is still busy pushing out the last pulse of data from the buffer.

In the crashReport it states code was executing from address 0x2270. Have you traced back to it using addr2line? What was it?

vindar commented 1 year ago

Hi,

Good news: the code that I posted above now work with the latest version of your lib :) Apparently your changes and adding the volatile did the trick (at least for the initial bug described in the opening post). The other problems I was experiencing and referring to in my second post were probably my own fault because I cannot reproduce them anymore !

By the way, with the new version of the library, in order to use my ILI9844 screen, I still had to #define R61529 (and not #define ILI9844) at the top of ILI948x_t4_mm.h... Shouldn't you swap the defines names ? Another suggestion: at the top of the header file, you may consider enumerating all the supported screen types. Maybe something like:

// uncomment below the line corresponding to your screen:

//#define ILI9481_1
//#define ILI9481_2
//#define ILI9486
#define ILI9488
//#define R61529

This way, it will be easy for the user to setup the library for a given screen by simply commenting /uncommenting a single line at the top of the main header file.

Finally, I also had to set the correct #define MADCTL_ARRAY for my ILI9488... Maybe you could automate this so that the user only needed to define the screen type and nothing else. Maybe something like:

#if defined (ILI9488) || defined (ILI9486)
#define MADCTL_ARRAY { MADCTL_MX | MADCTL_BGR, MADCTL_MV | MADCTL_BGR, MADCTL_MY | MADCTL_BGR, MADCTL_MX | MADCTL_MY | MADCTL_MV | MADCTL_BGR, MADCTL_MY | MADCTL_MV | MADCTL_BGR } // ILI9488/9486
#elif defined (ILI9481_1) || defined (ILI9481_2)
#define MADCTL_ARRAY { MADCTL_BGR | MADCTL_SS, MADCTL_MV | MADCTL_BGR, MADCTL_BGR | MADCTL_GS, MADCTL_MV | MADCTL_BGR | MADCTL_SS | MADCTL_GS } // ILI9481
#elif defined (R61529)
#define MADCTL_ARRAY { MADCTL_RGB, MADCTL_MV | MADCTL_MX | MADCTL_RGB, MADCTL_RGB | MADCTL_GS | MADCTL_MX, MADCTL_MV | MADCTL_RGB | MADCTL_GS } // R61529
#endif

Otherwise, I just tested my tgx library with your driver and it works very well. Attached in the .zip below is an example displaying a 3D mesh. Even without using the TE signal, the result is really good and tearing is not noticeable.

cyborg_ILI9488.zip

cyborgILI9488

david-res commented 1 year ago

Hi Arvind,

This is great news! Glad that it has worked out :) I'm not sure why the R615 code is needed for your ILI display, as I had used the ILI9488 code in the past and it had worked. I need to rework it all, will get it sorted this week.

I had also set the MADCTL as you suggested in a local build but never synced it.. So I just threw in what you had suggested for now.

Is there any benefit using your GFX library between the display driver & LVGL?

vindar commented 1 year ago

Hi David,

Hi Arvind,

This is great news! Glad that it has worked out :)

Yep, I just let the demo run for the whole night without trouble ! Seems like the driver quite stable :)

I'm not sure why the R615 code is needed for your ILI display, as I had used the ILI9488 code in the past and it had worked. I need to rework it all, will get it sorted this week.

It is weird indeed. The screen I have is this one from buydisplay with 8 bits parallel interface, 5V, no touchscreen, no font chip... I tried using the other defines (ILI9481_1, ILI9486, ILI9488) but only R61529 works correctly. With any other initialization code the screen prints garbage... For the MADCTL, I need to use the one for ILI9488 (the first one) otherwise colors are wrong.

By the way, there is a typo line 499 of ILI948x_t4_mm.cpp (an unwanted closing parenthesis )) that you should remove because it create a compile error when using #define ILI9844_2...

I had also set the MADCTL as you suggested in a local build but never synced it.. So I just threw in what you had suggested for now.

Is there any benefit using your GFX library between the display driver & LVGL?

It depends :) The TGX library is a general purpose graphic library for 2D and 3D drawing whereas LVGL aims at drawing (and managing) a user interface. TGX does not have special methods for drawing things like buttons or other widgets but it has advanced capabilities for drawing primitives (circles, lines, Bezier splines...) and also image manipulation (resizing, rotation, color conversion...). It is fast and and create high quality rendering (anti-aliasing, bilinear filtering...). The library is completely self-contained and does nothing 'under the hood'. In fact, all the library does is to draw into a memory framebuffer (encapsulated in an Image object) but it has no idea of the underlying hardware.

This means that you can always #include <tgx.h> in a project without any consequence. Only the subset of methods you use will actually be linked to your code and nothing ever happens in the background (in particular there is no dynamic memory allocation)... On the top of my head, I can see a few use cases where the library might be usefully combined with LVGL:

  1. To perform post-processing on the memory framebuffer created by lvgl before sending it to the screen (such as creating overlay, resizing...).
  2. Using TGX to draw 2D and 3D graphics into an image/framebuffer with can subsequently be included into an LVGL widget. For instance, this could be used to embed fast 2D/3D graphics into a windows inside a LVGL app... This might be particularly useful for a game or for displaying animations.

Now, I am going play a little bit more with LVGL and see how it looks in landscape mode. But I do not have a touchscreen on my display and the LVGL music demo does not show much tearing. I'll try to see if I can create an example that clearly shows screen tearing... May not have much time in the coming days but I will eventually experiment with this at some point !

david-res commented 1 year ago

I fixed the issue in line 499. But as I said, I will redo the whole thing so that there are no errors and it works as expected. I tested your 3D rendering code and it works very well - super impressive!!

To try see tearing in LVGL, perhaps try the demo widgets in slideshow mode? It should navigate between tabs and scroll the pages. I remember a couple years ago that I saw tearing even on 16 bit parallel bus (in polling mode on a T4.1)

Did you see my idea about using the PXP? I think what might be interesting to see, is to feed the LVGL FB in landscape orientation into the PXP and then rotate the image 90 degrees to feed it into the display in landscape. Do you think this could be a feasible solution, or is it too much of a hassle to actually try it out? EDIT: Might not have enough space in DMAMEM to fit enough buffers for LVGL & PXP.

vindar commented 1 year ago

I fixed the issue in line 499. But as I said, I will redo the whole thing so that there are no errors and it works as expected. I tested your 3D rendering code and it works very well - super impressive!!

Thanks :)

To try see tearing in LVGL, perhaps try the demo widgets in slideshow mode? It should navigate between tabs and scroll the pages. I remember a couple years ago that I saw tearing even on 16 bit parallel bus (in polling mode on a T4.1)

Ok, I will try it.

Did you see my idea about using the PXP? I think what might be interesting to see, is to feed the LVGL FB in landscape orientation into the PXP and then rotate the image 90 degrees to feed it into the display in landscape. Do you think this could be a feasible solution, or is it too much of a hassle to actually try it out? EDIT: Might not have enough space in DMAMEM to fit enough buffers for LVGL & PXP.

Yes, using the PXP seems a good idea. However, from what I recall, the PXP is not really faster for a simple rotation than doing it directly on the CPU (but it does free up CPU time for other computation meanwhile). So I think it may be simpler to experiment first with direct 'CPU' rotation and then add the PXP to the mix later... Rotating a 320x480 buffer by 90 degrees should take about 1 or 2ms so it will be fast enough. The real problem is that, for most programs, there is not enough RAM left available to afford 2 full-size framebuffers... And doing "in place" rotation of a non-square matrix is a really nasty (and time consuming) business and, unfortunately, I do not think that the PXP can do that :( So I want try some alternatives first.

vindar commented 1 year ago

Ok, I came up with a simple simple LVGL example that shows up tearing very clearly in landscape mode (source code attached below). If you change the value of setTearingScanLine() you can really see the tear line starting at the chosen position.

So let's now try to improve this... :)

lvgl_teared.zip

david-res commented 1 year ago

Loaded it onto my display - can clearly see it! So what ideas do you have in mind to try mitigate the tearing in landscape? It it worth trying LVGLs software rotation first?

vindar commented 1 year ago

Loaded it onto my display - can clearly see it! So what ideas do you have in mind to try mitigate the tearing in landscape? It it worth trying LVGLs software rotation first?

The problem with using LVGL software rotation is that, in order to prevent screen tearing, we must send full screen updates in sync with the TE signal. But LVGL cannot perform "in place" rotation of a buffer: I tried to force this behaviour in LVGL by setting the full_redraw and rotation flag in the driver and it failed miserably. This is not surprising because doing in place non-square rotation is very difficult and slow. But this means that, in order to use LVGL software rotation, we would need to provide LVGL with 2 full size framebuffers and that would cost 600K of RAM...

Here is another approach which I just tested and which seems to work fine. It only cost about 360K or RAM (1 full 'internal' framebuffer and a partial 60K framebuffer for LVGL).

The idea is to put the display in orientation 0 (portrait) and LVGL in in landscape mode (without software rotation). We only give LVGL the small (60K) partial framebuffer to draw onto. When LVGL send the small buffer to be drawn, we rotate it and copy it at its correct position in the full internal framebuffer. Then, when LVGL is done with all the partial redraw, we send the internal framebuffer in sync with the TE signal...

Here below is a code that illustrate the idea, as you can see tearing has disappeared in this example. Of course, it should be tried with a more complicated demo to see if it still performs correctly...

In order to test this code, you will need to use the 'improved-drawing-primitive' branch of my TGX library otherwise it will not compile.

You can experiment changing the size of the partial framebuffer. In my (quick) tests, even a 30K framebuffer works fine..

lvgl_tearfree.zip

david-res commented 1 year ago

Ill give the test app a try, and also try implementing it into my current app and see how it performs.

Im making a hybrid T4.1/Micromod test board which will add all of the FlexIO2 pins/LCD controller pins, as well as the PSRAM pins.

I wonder if using EXTMEM with full sized buffers will have a positive impact on the approach you’ve come up with above

vindar commented 1 year ago

Ill give the test app a try, and also try implementing it into my current app and see how it performs.

Here is a slightly improved version that only redraws the changed area and not the whole screen each time. This changes nothing for the stripes demo attached because the whole screen changes at each frame but it should save a little time in 'real use cases' where only part of the screen (e.g. widgets) get updated...

lvgl_tearfree2.zip

Im making a hybrid T4.1/Micromod test board which will add all of the FlexIO2 pins/LCD controller pins, as well as the PSRAM pins.

Nice :)

I wonder if using EXTMEM with full sized buffers will have a positive impact on the approach you’ve come up with above

I do not know. I think that EXTMEM is connected via QSPI at (about) 80Mhz so this may become the bottleneck... Can DMA/flexio output directly from EXTMEM ? Clearly, this it is worth a try because if it works, it will provide more than enough memory to perform real double buffering...

david-res commented 1 year ago

Hi Arvind, I downloaded your modified branch and the test program and indeed it works very well! Well done! This is really fantastic!

I implemented the code changes into my current app, and while on scrolling of the menus there is no more visible tearing, it has slowed down the animations enough for me to notice it. The is obviously due to the SW rotation that happens between the LVGL FB and the tgx FB. While in most cases this might suffice, I do wonder if using the PXP to do the exact same sized rotation offloaded from the CPU will improve the result

vindar commented 1 year ago

Hi David,

It is surprising that the copy/rotation slows down animations because I think it should not take much time. I did not time it but I would expect the rotation to take at most 1 or 2ms per frame so it should not really be noticeable.

I do not know if using the PXP will help here because I think performing the rotation with PXP will take just as much time and even if this frees up the CPU, there is nothing that LVGL can do during that time frame because both buffers (the small LGVL and the full internal one) are in use... Maybe we could try giving LVGL a second partial buffer that can be use during PXP operation but setting all this will be a bit tedious.

There is another possible improvement that may be easier to implement (but I do not know if it will help much). Right now, we always wait until the TE interrupt to start the screen DMA upload but in many cases, we could start earlier (because the scanline is already far down enough so that it will not be catched up by upload or because we upload only just a small region...). Knowing the period between two TE interrupts, we can compute the current position of the scanline at any moment so we can decide if upload may start right away or if we must wait for the TE interrupt (or even better, we can compute the first time when upload can start and then set a one-shot timer interrupt to trigger the DMA upload)... I do not have much time right now but I may try to test this later (probably in a few days).

david-res commented 1 year ago

Hi Arvind!

I think the lack of the 2nd buffer is the main cause of this issue, and agree that the PXP will take just as much time to rotate if not more, but it still frees up the CPU which could be crucial for the rest of the application to run.

No rush on testing further, at your own pace :) I appreciate the time invested in this thus far!

cstrahan commented 1 year ago

It is really a shame the Teensy does not have an (easily accessible) hardware debugger available. Here, it would prove really useful !

FWIW, if you have a steady hand and some patience, you can add a header to breakout the JTAG signals:

teensy_mod

I can successfully flash and debug my Teensy 4.1 and Teensy MicroMod this way :)

I followed the info over at https://forum.pjrc.com/threads/69748-Simple-WORKING-JTAG-UART-for-Teensy-4-1

My approach was to solder some long 32 AWG copper legs onto a 2x5 pin 0.05" pitch header (in the standard 10 pin SWD/JTAG pinout), super glue that to the top of the NXP (I dunno where my epoxy went, otherwise I'd have smothered the thing in epoxy by now), and then trim and solder the legs to either traces or (when available) vias that I had exposed by scraping off the solder mask. I can then connect that directly to my CMSIS-DAP debugger and debug over JTAG.

Kinda off topic, I know, but I figured I'd share, in case that provides any inspiration (I'm lurking because I'm thinking about using this library, or more accurately, porting it to Rust). :smile:

david-res commented 1 year ago

@cstrahan that looks pretty cool! I actually have a custom PCB here with more pins exposed, and all the MM parts, but I don't know why I didnt expose the JTAG pins! Would love to see this driver come in use for someone :)

david-res commented 1 year ago

@vindar I was able to get the PXP rotation working eventually, but the only major issue/ performance impact is setting LVGL to do full screen fresh reach time. It can rotate a 480320 image in a very shot time, around 15ms including FLEXIO/DMA transfers. I can cut down another 1ms by using two small rotation buffers instead of one (32096*2).