Bug with IDF SPI driver ? Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658)

jonshouse1 commented 3 years ago

This looks like a bug with the esp-idf SPI driver.

https://github.com/espressif/esp-iot-solution/issues/110

lexxvir commented 3 years ago

Hi there,

It seems I encountered the same bug. We use IDF v.4.2 branch. We have a custom board with ST25R3911 connected by SPI to ESP32. For internet connectivity we are using WiFi or Ethernet. When we use WiFi all works fine. When we use Ethernet it most time works fine but it's possible to "break" ST25R3911 by sending many ICMP requests to the ESP32 (ping -f for couple of seconds is enough). Same ping -f command does not "break" ST25R3911 when we use WiFi.

Meantime, in another project we use the same HW and IDF v.3.3. In that case there is no such an issue, so I think it's an issue in the v.4.x branch.

By "break" ST25R3911 I mean that it falls to "odd" state in that it does not process the commands.

corecode commented 3 years ago

Can confirm SPI transfer issues when ethernet is receiving packets. For us, it seems that the SPI data transmit pointer is reversed by 32 bytes, then the read pointer jumps back to its original value and the rest of the data is transferred correctly. Sometimes this reversal happens after just a few transmitted bytes (again seems 2^n length, e.g. 64 bytes); sometimes this offset exists from the beginning, for our whole SPI transfer (almost 4000 bytes). Of course I don't know whether it is the SPI data transmit pointer; just the data appears "shifted", i.e. sections repeat (for a while).

Higher ethernet packet rates and larger ethernet packets make the bug appear more frequently. With the right nping settings I can trigger this bug several times per second.

elcojacobs commented 3 years ago

We have been chasing this bug for 2 weeks and I think I finally identified the relationship between Ethernet and display glitches. Our display glitches could be explained by missed set x/y commands.

The ILI9488 samples the D/C pin on the first falling clock after CS goes low. This requires SPI mode 3 to have an idle high CLK. But that's not all.

We use DMA for all display writes and set the DC pin in the pre-callback. When Ethernet is not connected, I think most display writes finish immediately and the dma queue stays empty. As a result, the CS pin is toggled by the DMA SPI handler for each transfer. When Ethernet is connected, the ethernet task loads the CPU or dma handler with other tasks, which allows the display dma queue to fill with more than one transfer. When these dma transfers are handled back to back, the CS pin stays low. The second transfer in the queue does not have a CS pin high to low transition and the DC pin is not resampled.

We fixed it by adding a write high and write low to the CS pin immediately after setting the DC pin, in the pre-callback.

What led you to conclude that the pointer is wrong?

corecode commented 3 years ago

i don't know if there is a pointer that is wrong. i can see data repeating (skipping back) during a transfer, but sometimes only part of the data in the middle of the transaction.

On November 8, 2021 1:47:52 AM PST, Elco Jacobs @.***> wrote:

We have been chasing this bug for 2 weeks and I think I finally identified the relationship between Ethernet and display glitches. Our display glitches could be explained by missed set x/y commands.

The ILI9488 samples the D/C pin on the first falling clock after CS goes low. This requires SPI mode 3 to have an idle high CLK. But that's not all.

We use DMA for all display writes and set the DC pin in the pre-callback. When Ethernet is not connected, I think most display writes finish immediately and the dma queue stays empty. As a result, the CS pin is toggled by the DMA SPI handler for each transfer. When Ethernet is connected, the ethernet task loads the CPU or dma handler with other tasks, which allows the display dma queue to fill with more than one transfer. When these dma transfers are handled back to back, the CS pin stays low. The second transfer in the queue does not have a CS pin high to low transition and the DC pin is not resampled.

We fixed it by adding a write high and write low to the CS pin immediately after setting the DC pin, in the pre-callback.

What led you to conclude that the pointer is wrong?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/espressif/esp-idf/issues/7380#issuecomment-962979040 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

elcojacobs commented 3 years ago

Repeated parts of the screen could perhaps come from a missed set position command. Instead of overwriting the same part, you would be writing to a different position.

I can't know whether your problem is the same as ours, but perhaps try toggling CS high and low after setting DC and before starting the SPI transfer.

I thought it was a memory bug in the driver too, but this has fixed it for us. It still could be a timing issue with a bug in the driver, so keep us posted on what you find.

corecode commented 3 years ago

Seems 90c4827bd22aa61894a5b22b3b39247a7e44d6cf introduced this bug (bisect result).

corecode commented 3 years ago

See https://github.com/espressif/esp-idf/pull/7874 for a bugfix. I don't know if this will fix it for everybody, and whether the change from 32 to 16 bytes RX burst length is sufficient in all cases, but it seems to work for us.

corecode commented 3 years ago

Turns out that 16 fixed it for my test firmware, but I have to go to 8 bytes for our main application firmware.

elcojacobs commented 2 years ago

Nice find. Do you know why the burst length has an effect?

corecode commented 2 years ago

no clue. If I had to guess, maybe the SPI DMA doesn't get time on the bus, but doesn't have a way to deal with it and instead outputs old data in its buffer and at some point somehow snaps back to the correct read address.

On 13/11/2021 15:38, Elco Jacobs wrote:

Nice find. Do you know why the burst length has an effect?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/7380#issuecomment-968175533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLOO2CB55TW2NZGNTD2BDUL3ZJDANCNFSM5BWADTIQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jonshouse1 commented 2 years ago

" but I have to go to 8 bytes for our main application firmware."

I played with these values months back, but dismissed the approach as the length required to make SPI reliable is so short that I suspect it defeats the entire point of using DMA for the transaction?

corecode commented 2 years ago

it's the dma burst setting for Ethernet. idf 3.3 used 4 bytes.

On November 14, 2021 7:50:08 AM PST, jonshouse1 @.***> wrote:

" but I have to go to 8 bytes for our main application firmware."

I played with these values months back, but dismissed the approach as the length required to make SPI reliable is so short that I suspect it defeats the entire point of using DMA for the transaction?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/espressif/esp-idf/issues/7380#issuecomment-968316127 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

jonshouse1 commented 2 years ago

"it's the dma burst setting for Ethernet. idf 3.3 used 4 bytes."

Ahh ok, interesting, that would make me simply wrong, sorry. I neglected to compare the SPI drivers to previous versions. I mostly focused on the driver handling the display in my project rather than the driver handling Ethernet.

I tried hacking the settings in the "esp-iot-solution/components/bus/spi_bus.c" code as this seems to be the SPI driver that the SPI LCD on my VNC client project is using. I also tried tweaking the idf SPI code but failed to make any real improvements. A vaguely remember that disabling DMA for the SPI driver fixed the issue but I was not shocked when it was so slow as to be worthless.

jonshouse1 commented 2 years ago

Just to confirm I changed emac_hal.c as per #7380 and rebuilt my project.

I've been running the VNC client for a couple of hours and see no display corruption so far.

Thanks.

elcojacobs commented 2 years ago

Did you try the cs toggle after each DC pin change? It does sound like a timing issue to me. Whether the CS pin toggle works because of how the display samples the pin or because it caused a small delay, I'm not sure, but I have not seen any glitches since.

Maybe shortening the burst length causes the ethernet DMA to not hold the DMA long enough to let SPI DMA queue up.

Another thing worth trying is to explicitly give the ethernet DMA and SPI DMA different channels.

jonshouse1 commented 2 years ago

Did you try the cs toggle after each DC pin change?

Not sure who this is addressed to ?

Just for clarity, I tried changing many settings in the "esp-iot-solution/components/bus/spi_bus.c" driver, the SPI display driver my project seems to be using. No change fixed the issue. I tried changing dma_chan from the original "host_id"

Also tried changing these values below, my comments are tagged JA

None of this had any positive effect. Your theory does not seem to match the observation of "corecode" or my experimentation.

I since changed back to the default spi_bus.c and applied #7380 and I can confirm that seems to fix the display corruption issue for me, on the version of tools I am using.

$ idf.py --version
ESP-IDF v4.4-dev-1594-g1d7068e4b-dirty


spi_bus_device_handle_t spi_bus_device_create(spi_bus_device_handle_t bus_handle, const spi_device_config_t *device_conf)
{
    SPI_BUS_CHECK(NULL != bus_handle, "Pointer error", NULL);
    _spi_bus_t *spi_bus = (_spi_bus_t *)bus_handle;

    _spi_device_t *spi_dev = malloc(sizeof(_spi_device_t));
    spi_device_interface_config_t devcfg = {
        .command_bits = 0,
        .address_bits = 0,
        .dummy_bits = 0,
        .clock_speed_hz = device_conf->clock_speed_hz,
        .duty_cycle_pos = 128,      //50% duty cycle
        .mode = device_conf->mode,
        .spics_io_num = device_conf->cs_io_num,
        // Keep the CS low 3 cycles after transaction, to stop slave from missing the last bit when CS has less propagation delay than CLK
        .cs_ena_posttrans = 4,          // JA, was 3 now 5
        .cs_ena_pretrans = 4,           // JA added, pause after chip select before spi write
        //.flags = SPI_DEVICE_HALFDUPLEX,       // JA tried, kills touch driver
        .queue_size = 1                 // JA changed from 3 to 1
`

elcojacobs commented 2 years ago

You changed the dma queue to length 1, which means you cannot queue up more than 1 transfer and will get a CS toggle before each transfer. If that didn't help, my theory is incorrect.

Another thing to note is that setting the receive length to 0 for a transaction is NOT disabling rx, that will make it default to the transmit length. You really need to set the rx data pointer to nullptr. If the rx pointer is uninitialized, SPI reception can overwrite random memory.

I'm just trying to help find the real cause, I have solved my own problem already and have a working display at 16mhz SPI with a dma queue of length 10. Toggling the DC pin is done bij de pre-callback in the dma handler.

Reducing the tx burst length might work, but it is just a workaround.

jonshouse1 commented 2 years ago

"If the rx pointer is uninitialized, SPI reception can overwrite random memory."

Yes I can see that would be the case. You talk as if this is my driver and I somehow have agency over it?, I expect others above my paygrade to ship drivers that work! ... if you feel you can do better then please have a go at diagnosing and fixing the issue.

My changes where mostly by blind feel, currently my test gear is packed away in storage and my skills are marginal so clearly I am not the best person to fix the issue. I tried lots or permutations, the code I pasted was simply the state it was in when I gave up my poking at it. As I said nothing before #7380 made any difference, if you feel you can nail the issue down to a clear fault and solution then please do so.

elcojacobs commented 2 years ago

I was led here by the issue that @corecode created. I had a quick look at your use of the driver and I do see a potential problem:

https://github.com/jonshouse1/ESPVNCC/blob/49cb7454c9cb6a9386fb05162651ef44c0a3c8cc/main/lcd_vncc.c#L363

Here you give a stack-allocated local buffer (pixels) to the display driver. If this driver sends the data asynchronously using DMA, then this function will return and the buffer will be deallocated. This means the memory can be overwritten before it is transferred by DMA.

If the function jag_draw_bitmap is blocking until the transfer is completed, then giving it a stack-allocated buffer is fine. I don't have the time to dive deeper to figure this out. If the jag_draw_bitmap function queues a transfer and doesn't wait for completion, then anything the processor does between the return of this function and the actual transfer can overwrite your pixels.

jonshouse1 commented 2 years ago

I was led here by the issue that @corecode created.

? Please clarify, this is probably something I have missed. I opened this bug report and I see no issues against ESPVNCC yet.

Here you give a stack-allocated local buffer (pixels) to the display driver.

I did wonder that, if you look back through the commits you will see several attempts at different semaphores. Frankly I lack the skill to unpick issues if the drivers simply do not work for the one workload I am doing! If you think the issue with display corruption is just my code then you are wrong (not saying my code might not have all kind of issues, just that the CORE issue with it is not my code!) The display corruption seems to be an interaction between the drivers for physical Ethernet and DMA driven SPI . I've written several bits of test code that prove this to be the case. IE Ethernet alone works fine, SPI Display alone works fine, only the two together cause issues, even if the code does not process the network data the interaction still exists. Rebuilding ESPVNCC to use Wifi 100% fixes it. Adding a delay between Ethernet RX and SPI write improves it, other people see the exact same issue and so on ...... I am 100% convinced that this is not just my issue, but an actual issue.

I can't really do much better in my code until I get a working IDF and drivers, then I will have time (and clarity) to add the missing locks. Put simply I can not debug hardware, my code AND the IDF drivers at the same time.

Please keep the conversation here to the core issue ("Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658)" If you want to open an issue against ESPVNCC or email me directly then I welcome your comments, fixes or improvements. This was my first project of its type and a new implementation of a VNC client from scratch so I do not expect it not to be bug free, nor do I regard myself as the last word in code quality :-) If you wish to take a stab at fixing the issue here then please feel free, but only hardware using Physical Ethernet will reproduce the issue. Thanks.

jonshouse1 commented 2 years ago

PS I am also seeing a periodic infrequent crash with my code, this actually may be the issue with my code "elcojacobs" just described :-) Will others will using #7380 please check the long term stability of their projects. As I said probably just my issue at this point?


rect 00004 xpos=00003 ypos=00068 width=00237 height=00013 et=0 took 10ms
rect 00005 xpos=00003 ypos=00094 width=00237 height=00104 et=0 took 80ms
Got VNC_SMT_FRAMEBUFFERUPDATE 1 rectangles
assertion "handle == get_acquiring_dev(host)" failed: file "IDF/components/driver/spi_master.c", line 949, function: spi_device_polling_end

abort() was called at PC 0x40100b1b on core 0

Backtrace:0x400d3b37:0x3ffce5900x4008794d:0x3ffce5b0 0x4008d60a:0x3ffce5d0 0x40100b1b:0x3ffce640 0x40085559:0x3ffce670 0x400855cd:0x3ffce6a0 0x400db033:0x3ffce6c0 0x400db185:0x3ffce720 0x400db2c4:0x3ffce750 0x400dbf44:0x3ffce790 0x400dc047:0x3ffce7d0 0x400d9310:0x3ffce800 0x400d9ca6:0x3ffce820 0x400d9dfe:0x3ffcf060 0x400da2f1:0x3ffcf0a0 0x4008a769:0x3ffcf1d0

elcojacobs commented 2 years ago

? Please clarify, this is probably something I have missed. I opened this bug report and I see no issues against ESPVNCC yet.

I was referring to https://github.com/espressif/esp-idf/pull/7874

If you think the issue with display corruption is just my code then you are wrong (not saying my code might not have all kind of issues, just that the CORE issue with it is not my code!) The display corruption seems to be an interaction between the drivers for physical Ethernet and DMA driven SPI.

If you are giving the SPI DMA driver a pointer to memory that is deallocated, that is a bug in your code. If the ethernet driver is the only piece of code overwriting that same area of memory, it is perfectly allowed to do so, because you released the memory. You could make the pixel buffer static to not release it and re-use it every time you call that function.

I am not certain there is no bug in the ESP-IDF driver. My point is just that the fact that using ethernet and DMA at the same time causes corruption does not mean necessarily that the ESP-IDF drivers are buggy.

If you wish to take a stab at fixing the issue here then please feel free, but only hardware using Physical Ethernet will reproduce the issue.

I am not using your code in any way, so I won't open an issue against it or fix it. If there is an actual bug, I want to know. That's why I am replying with the issues that I found to help others fix bugs in their code or pinpoint the actual issue in the drivers.

I have a board with an SPI display (ILI9488) and hardware ethernet (LAN8724) and open-source firmware: https://github.com/BrewBlox/brewblox-firmware I was convinced there was a bug in ESP-IDF that caused interaction between SPI and ethernet until I finally found the issue in our code that caused the interaction.

Writing DMA code is hard, you'll have to take into account many things, like

Only allocating DMA-capable memory and only freeing it when the DMA transfer is done
Thread-safe access to the DMA buffer
Toggling CS/DC pin at the right time for the display

corecode commented 2 years ago

Elco, you are derailing this issue.

The problem is that if you use the normal SPI driver (which uses DMA internally) and you receive Ethernet frames at the same time, then occasionally the SPI transfers incorrect data.

Changing the (internal) Ethernet DMA burst size back to what it was in IDF 3.3 makes these incorrect transfers disappear. ESP bug. Maybe hardware.

On November 15, 2021 10:16:46 AM PST, Elco Jacobs @.***> wrote:

? Please clarify, this is probably something I have missed. I opened this bug report and I see no issues against ESPVNCC yet. I was referring to https://github.com/espressif/esp-idf/issues/7380 My reply should have been to that issue, but I accidentally placed it here.

If you think the issue with display corruption is just my code then you are wrong (not saying my code might not have all kind of issues, just that the CORE issue with it is not my code!) The display corruption seems to be an interaction between the drivers for physical Ethernet and DMA driven SPI.

If you are giving the SPI DMA driver a pointer to memory that is deallocated, that is a bug in your code. If the ethernet driver is the only piece of code overwriting that same area of memory, it is perfectly allowed to do so, because you released the memory. You could make the pixel buffer static to not release it and re-use it every time you call that function.

I am not certain there is no bug in the ESP-IDF driver. My point is just that the fact that using ethernet and DMA at the same time causes corruption does not mean necessarily that the ESP-IDF drivers are buggy.

If you wish to take a stab at fixing the issue here then please feel free, but only hardware using Physical Ethernet will reproduce the issue.

I am not using your code in any way, so I won't open an issue against it or fix it. If there is an actual bug, I want to know. That's why I am replying with the issues that I found to help others fix bugs in their code or pinpoint the actual issue in the drivers.

I have a board with an SPI display (ILI9488) and hardware ethernet (LAN8724) and open-source firmware: https://github.com/BrewBlox/brewblox-firmware I was convinced there was a bug in ESP-IDF that caused interaction between SPI and ethernet until I finally found the issue in our code that caused the interaction.

Writing DMA code is hard, you'll have to take into account many things, like

Only allocating DMA-capable memory and only freeing it when the DMA transfer is done

Thread-safe access to the DMA buffer

Toggling CS/DC pin at the right time for the display

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/espressif/esp-idf/issues/7380#issuecomment-969186223 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

elcojacobs commented 2 years ago

There are 2 functions: spi_device_queue_trans and spi_device_transmit. spi_device_transmit just calls spi_device_queue_trans and waits for the (DMA) transfer to complete. If SPI transfers have errors when using the blocking spi_device_transmit, then I agree that it is an internal framework issue.

corecode commented 2 years ago

I use blocking spi transmit and have problems.

On 15/11/2021 11:27, Elco Jacobs wrote:

There are 2 functions: |spi_device_queue_trans| and |spi_device_transmit|. |spi_device_transmit| just calls |spi_device_queue_trans| and waits for the (DMA) transfer to complete. If SPI transfers have errors when using the blocking spi_device_transmit, then I agree that it is an internal framework issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/7380#issuecomment-969244111, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLOO4AAMVXJCZVZKT526DUMFNK5ANCNFSM5BWADTIQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jonshouse1 commented 2 years ago

If you are giving the SPI DMA driver a pointer to memory that is deallocated, that is a bug in your code. If the ethernet driver is the only piece of code overwriting that same area of memory, it is perfectly allowed to do so, because you released the memory. You could make the pixel buffer static to not release it and re-use it every time you call that function.

I am not certain there is no bug in the ESP-IDF driver. My point is just that the fact that using ethernet and DMA at the same time causes corruption does not mean necessarily that the ESP-IDF drivers are buggy.

I agree, thank you for pointing out the error, my code is plain old wrong... I can't be any clearer.

BUT I also can not be any clearer. I've done some tests (not just hacking the same project over and over) and the esp IDF code is buggy !

For example, if I statically allocate a buffer (in global namespace). Fill it with a single value, then write that to the display over and over using the drivers provided then I get a solid colour on my screen, as expected. If I now use a Linux machine to send moderate amount of UDP data to the ESP then the display corrupts..... you do not even need to read the data from the socket, simply the act of writing SPI while Ethernet is active is enough.

Only allocating DMA-capable memory and only freeing it when the DMA transfer is done

Done

Toggling CS/DC pin at the right time for the display

That is what the fucking drivers are for !

I can not make it any more plain !

I am using IDF and "iot solution" drivers. I expect them to handle locking, shared access to SPI bus and chip selects, because ...... That is what the fucking drivers are for !

Thanks, Jon

jonshouse1 commented 2 years ago

if there is an actual bug, I want to know. That's why I am replying with the issues that I found to help others fix bugs in their code or pinpoint the actual issue in the drivers.

From my point of view it is very difficult to debug my own code if the drivers provided for me by the toolkit do not work properly, and worse still ONLY do not work properly for the ONE permutation of hardware i've chosen to use.

I will try and fix my code. I've moved the pixel buffer to a global, double buffered it so that the display and network code use alternate halves of the buffer and re-enabled a semaphore in the display code.... or will someone now claim I should write my own display driver and bless each pixel in turn?


SemaphoreHandle_t               xs              = NULL;

void jag_init(scr_driver_t* driver)
{
        jag_lcd_drv     = *driver;
        scr_info_t      lcd_info;

        jag_lcd_drv.get_info(&lcd_info);
        jag_width       = lcd_info.width;
        jag_height      = lcd_info.height;
        jag_lcd_drv.get_info(&lcd_info);
        if (xs == NULL)
                xs = xSemaphoreCreateMutex();
        ESP_LOGI(TAG,"jag_init() - Screen name:%s | width:%d | height:%d", lcd_info.name, lcd_info.width, lcd_info.height);
}

void jag_draw_bitmap(uint16_t x, uint16_t y, uint16_t w, uint16_t h, uint16_t *bitmap)
{
        esp_err_t       ret;

        if (xSemaphoreTake( xs, ( TickType_t ) 1000/portTICK_PERIOD_MS ) == pdTRUE )
        {
                ret=jag_lcd_drv.draw_bitmap(x, y, w, h, (uint16_t*)bitmap);             // Call ili9341 driver, limited to 4000ish bytes
                if (ret!=ESP_OK)                                                        // set_window failed and no data was written
                {
                        ESP_LOGE(TAG,"draw_bitmap returned %d",ret);
                }
                xSemaphoreGive(xs);
        }
        else ESP_LOGE(TAG,"jag_draw_bitmap() Failed to aquire semaphore");
}
`

jonshouse1 commented 2 years ago

For "elcojacobs" I would expect draw_bitmap() to co-operate with the SPI driver to finish any DMA driven transaction it has in progress before it sets up the next one, yet I find (even with better buffer management) that I get SPI bus locking errors? Am I the sole user of the esp-iot-solution display and touch screen code?, googling for answers it feels that way.

First chapter of the documentation for esp-iot-solution, Page 9, point 2 "Thread safe device operation", yet according to you I need to manage each and every thing myself by hand at a low level?

https://docs.espressif.com/_/downloads/espressif-esp-iot-solution/en/latest/pdf/

elcojacobs commented 2 years ago

I'll explain the DC pin issue once more.

The ESP-IDF driver is responsible for SPI handling and ensuring the SPI device is selected (the SPI mutex is owned and the CS pin is low). The DC (data/command) pin is a specific display thing and not the responsibility of the ESP-IDF driver. It might be the responsibility of the draw_bitmap function/jag library, but not the core ESP-IDF code.

The ILI9488/ILI9341 samples the DC pin on the first falling SPI clock after the CS pin goes low.

The ESP-IDF driver ensures that you own the SPI bus before the transaction starts. It does not guarantee that it releases the SPI bus in between if you do 2 transactions in a row.

So if you do multiple transfers back-to-back, like alternating a setpos command for the display and writing actual pixel data, you do the following:

set DC pin for command
transfer command + coordinates
set DC pin for data
transfer pixel data
set DC pin for command
send command + new coordinates

If the code reaches nr 6, while 4 has not finished, it will wait for 4 to finish. Either in the DMA handler or in the blocking SPI transfer function. Then it notices that the previous transfer was the same client, the mutex is still owned and the CS pin is still low, so it can immediately start transferring pixel data.

This creates a problem: you have set the DC pin for the display correctly, but the CS pin remained low. The display didn't re-evaluate the DC pin because of this and the transfer of step 6 is interpreted as data instead of a command.

The ESP-IDF SPI driver is handling SPI as it should. The SPI device was properly selected for all transfers. It didn't know it had to release the CS pin in-between transfers to the same device to trigger the re-sampling of the DC pin.

Step 4 is not finished when step 6 is reached might only occur when step 4 is interrupted by ethernet DMA and therefore took a bit longer.

jonshouse1 commented 2 years ago

The ESP-IDF driver is responsible for SPI handling and ensuring the SPI device is selected (the SPI mutex is owned and the CS pin is low). The DC (data/command) pin is a specific display thing and not the responsibility of the ESP-IDF driver. It might be the responsibility of the draw_bitmap function/jag library, but not the core ESP-IDF code.

3rd line of the README.md for ESPVNCC "For ESP IDF with esp-iot-solution display drivers."

https://github.com/espressif/esp-iot-solution

Not MY drivers at all. Espressifs.

Driver handles CS AND DC, see line 158 for example. https://github.com/espressif/esp-iot-solution/blob/master/components/display/screen/interface_driver/scr_interface_driver.c

Interestingly I do understand what you say and have written LCD drivers for PICs, but the LCD driver code IS NOT MY CODE. So please stop saying "You need to" when addressing me!

As I said "That is what the fucking device driver is for !" .. not MY device driver, THE device driver.

My expectation is that it just works - at the same time as Ethernet, the rest is for others "above my paygrade"

jonshouse1 commented 2 years ago

PS If you now claim the project ESP IDF need not work properly with project IOT-SOLUTION then I will have a very short two word answer for you.

jonshouse1 commented 2 years ago

set DC pin for command transfer command + coordinates set DC pin for data transfer pixel data set DC pin for command send command + new coordina

https://github.com/espressif/esp-iot-solution/blob/master/components/display/screen/interface_driver/scr_interface_driver.c Lines 207 onwards. Clearly setup an SPI transaction after setting the GPIO state for CS and DC and then performs either a data or command related SPI function, I would assume some kind of lock prevents it from performing another SPI transaction until it has finished all data/command sequences related to a single display transaction, otherwise the results would be an obvious mess ?
Or are you claiming that the driver is not written that way, in which case do I need to file yet more bug reports.

Starting to wish I had used the retched Arduino IDE instead now ......

elcojacobs commented 2 years ago

This can indeed be considered a bug in esp-iot-solution probably and I fully understand that you expect it to work, but this issue is posted in esp-idf and not in esp-iot-solution.

I am just pointing out what might cause it, I am not trying to blame you.

If my input is not appreciated, this will be my last reply. Good luck with your project.

jonshouse1 commented 2 years ago

This can indeed be considered a bug in esp-iot-solution probably and I fully understand that you expect it to work, but this issue is posted in esp-idf and not in esp-iot-solution.

1) As I said, others are also seeing this bug, they are using generic SPI devices, not iot solution code. 2) I asked you to stick to the topic of the bug report only, and asked you to pick another channel to communicate about my project if that was what you wanted to do - you flat out refused. 3) I told you REPEATEDLY that I am a user, yet you insist on telling me "You need to X" and then rant at great length about low level stuff, mostly unrelated to the actual bug report best I can tell. 4) I doubt you have the ACTUAL hardware to duplicate this issue and seem unwilling to help resolve it, but instead are using it is a forum to have pissing competition to show your supreme knowledge. If I wanted that kind of shit I would join the Linux Kernel mailing list. 5) You did not read most of what I posted. 6) Your advice on what I "must do" flatly contradicts the library documentation for the display library I told you I was using. My project readme is tiny, but I think you did not read it. 7) corecode actually told you "Elco, you are derailing this issue.", you ignored him as well.

I am just pointing out what might cause it, I am not trying to blame you.

That is fine. As I said my code may be faulty, but after reviewing the iot-solution documentation probably not in the way you claim

If my input is not appreciated, this will be my last reply.

It is not for me to say, personally I find you deeply annoying but you may be working with others here in some way more constructively, this not my personal forum either.

Good luck with your project.

Ah that seems to be the issue. This is a user bug report, not a forum for my project. I acknowledged that my talent may be marginal, that my code may be broken in some ways BUT I clearly told you that the issue was real, others had also observed it and that THIS forum is the bug report - NOT not for my project and NOT for you to show your supreme cleverness while ignoring the actual bug report.

Maybe I should have posted the bug with a cut down code example instead, that is my fault.

Thank you corecode for #7380, that change seems to fix display corruption for me. I have a periodic crash, but that may be related to the iot-solution display code or possibly my failure to correctly drive it.

elcojacobs commented 2 years ago

Jesus, I just added info that I think could help people at espressif solve the bug or pinpoint the problem.

I'm not trying to prove anything. I spent 2 weeks chasing a display bug and I am sharing what I found in attempt to help improve the framework that we all use.

jonshouse1 commented 2 years ago

I'm not trying to prove anything. I spent 2 weeks chasing a display bug and I am sharing what I found in attempt to help improve the framework that we all use.

In that case you remind me of talking to my wife, lots of words but maybe forgetting to include the basic context.

Is the display bug related to to an IDF display driver or do you have your own display driver? If it relates to a driver in IDF then it should not be posted against that? If it is your own display driver then what did you learn about the interaction between Ethernet and DMA SPI. If you are NOT using Ethernet and SPI then you are wasting your time and ours as this is ONLY about that combination and its interaction, you seem to be writing a tutorial on SPI displays rather than contributing to this bug report best I can tell?

elcojacobs commented 2 years ago

I had display corruption, using lvgl on top of esp-idf SPI drivers, which occurred only when Ethernet was plugged in. With lan8724 phy and internal mac.

No display corruption without Ethernet plugged in. If Ethernet was actually used, I would get display glitches, unplug the cable, no glitches. The app could have 2 network connections and 2 ip addresses, so I can say that I had no glitches with wifi with the same code firmware binary.

But you are convinced that this could not have the same cause as the issue reported here, so please forget anything I said. 🙄

jonshouse1 commented 2 years ago

I had display corruption, using lvgl on top of esp-idf SPI drivers, which occurred only when Ethernet was plugged in. With lan8724 phy and internal mac.

No display corruption without Ethernet plugged in. If Ethernet was actually used, I would get display glitches, unplug the cable, no glitches. The app could have 2 network connections and 2 ip addresses, so I can say that I had no glitches with wifi with the same code firmware binary.

Fantastic, why not lead with that!

But you are convinced that this could not have the same cause as the issue reported here, so please forget anything I said.

Sounds the same as my issue. How does a long tutorial on writing display drivers, Data/CMD and chip selects on SPI displays help with this. You seem to be saying that the DMA transaction stops in the wrong place, but wrong relative to what? The information you provide seems to relate to your experience of a specific display driver, and (had you read anything I had written) you would know not the display driver I am using. As this seems to effect any SPI+Ethernet combination then why does the fault not lie with the Ethernet driver as #7380 suggests?

Maybe you could summarise in way avoiding display driver specifics what you learnt. Is #7380 is not a viable fix for the issue?, if not then why? If it is a viable fix, then why all the extra words and fluff about my code and display drivers from you, can you simplify and summarise the additional point you are making?

elcojacobs commented 2 years ago

I did try to explain why a timing issue could cause the CS to stay low, which could cause the display miss the command/data selection. The issue is subtle, it requires understanding of how the display samples the pin and how seemingly unrelated dma transfers can cause timing differences. So I needed a lot of words, and you misguided your anger at espressif at me because of it. Yes I did point out other potential bugs that I found in an effort to help in case we all share a bug with similar symptoms but other causes, and you think I am an asshole because of it. A timing difference could mask both memory bugs or the DC pin sampling issue. Changing the ethernet burst length changes the time an Ethernet DMA transfer takes.

I was watching this issue because I thought there was a bug in the low level drivers. I found out there maybe wasn't and figured it would be nice to share my findings.... I was met with "no this is a bug in the driver, please shut up", so I tried to explain it in more detail. Only to be compared to your wife or elitist Linux devs.

corecode commented 2 years ago

my issue often presents as data in the middle of the spi transaction gets corrupted. I don't think that can be explained with CS signals.

On November 15, 2021 5:40:52 PM PST, Elco Jacobs @.***> wrote:

I did try to explain why a timing issue could cause the CS to stay low, which could cause the display miss the command/data selection. The issue is subtle, it requires understanding of how the display samples the pin and how seemingly unrelated dma transfers can cause timing differences. So I needed a lot of words, and you misguided your anger at espressif at me because of it. Yes I did point out other potential bugs that I found in an effort to help in case we all share a bug with similar symptoms but other causes, and you think I am an asshole because of it. A timing issue could mask both memory bugs or the DC pin sampling issue.

I was watching this issue because I thought there was a bug in the low level drivers. I found out there maybe wasn't and figured it would be nice to share my findings.... I was met with "no this is a bug in the driver, please shut up", so I tried to explain it in more detail. Only to be compared to your wife or elitist Linux devs.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/espressif/esp-idf/issues/7380#issuecomment-969637740 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

jonshouse1 commented 2 years ago

I did try to explain why a timing issue could cause the CS to stay low, which could cause the display miss the command/data selection.

We have at least 3 different permutations of drivers and code that all fail to work in almost exactly the same way in this bug report. Are you claiming that 3 sets people have 3 different subtly wrong programs that only show breakage when Ethernet is active because that would seem very very unlikely? You seem to have grasped the premise that my code is faulty, hence the report is invalid and the others are also invalid, ok possible, but unlikely and as I say flatly contradicted by the tests I did (I guess you are forcing me to re-write or find them ...)

One of my tests was to write every line of the LCD over and over with a single colour from a global static buffer, even that corrupts when Ethernet is active, now you are going to claim that this code is subtly wrong? that my pointer to 240 x16 bits of a solid colour is somehow not quite correct, does not meet some magic pixel painting constraint ?

suda-morris commented 2 years ago

Interesting topic, I did a quick test by putting the LAN8720 Ethernet initialization code into this example, and both of them just work fine, even without the PR fix FYI, I didn't use the iot-solution driver but the st7789 driver located in esp-idf. For that PR, we will still take it into consideration, maybe will make the burst size configurable.

corecode commented 2 years ago

Hi @suda-morris, thanks for looking at this. You will have to send a high ethernet packet load to make it become more likely. Depending on the packet load, I see the transmission error once per minute or several times a second. Longer frames make the transmission error more likely.

You don't even have to send IP packets; it is sufficient to send any ethernet frame (that get discarded immediately by lwip).

I used this command:

sudo nping --count 0 --quiet --rate 100000 --ether-type 0x0700 --dest-mac cc:50:e3:be:f7:5b --data-length 1400 169.254.92.111

philippe44 commented 2 years ago

I have a similar issue on my project. I recently added ethernet so there are now lots of concurrent accesses on SPI bus (3 devices: W5500, display and gpio expander). W5500 on its own works, spi driver as well (has been working for years now) but when they co-exist, there is a crash after a few seconds of high load. All transactions are polling.

The crash happens during a display transaction because, although the display transaction has started, get_acquiring_dev() claims there is nobody that has acquired the device (NULL returned). Changing dmabmr.rx_dma_pbl = EMAC_DMA_BURST_LENGTH_8BEAT;does not do anything

corecode commented 2 years ago

This issue is only related to the embedded emac, not external wiznet via SPI.

On 31/12/2021 00:25, philippe44 wrote:

I have a similar issue on my project. I recently added ethernet so there are not lots of concurrent write on SPI bus (3 devices: W5500, display and gpio expander). W5500 on its own works, spi driver as well (has been working for years now) but when they co-exist, there is a crash after a few seconds of high load. All transactions are polling.

The crash happens during a display transaction because, although the display transaction has started, get_acquiring_dev() claims there is nobody that has acquired the device (NULL returned). Changing |dmabmr.rx_dma_pbl = EMAC_DMA_BURST_LENGTH_8BEAT; |does not do anything

— Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/7380#issuecomment-1003307036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLOO2B2AYU3VXGARGDRG3UTVSGPANCNFSM5BWADTIQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

philippe44 commented 2 years ago

This issue is only related to the embedded emac, not external wiznet via SPI.

Yep I realized. I thought the connection with get_acquiring_dev() would suffice to give me some pointer but no. I've opened another issue https://github.com/espressif/esp-idf/issues/8179

espressif / esp-idf

Bug with IDF SPI driver ? Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658) #7380