SPIRAM accesses seem to interfere with UART at high speed (IDFGH-4309)

davidthings commented 3 years ago

Environment

Development Kit: [ESP32-Wrover-Kit]
Kit version (for WroverKit/PicoKit/DevKitC): [v4]
Module or chip used: [ESP32-WROVER-B]
IDF version (run git describe --tags to find it): v4.2-dev-3672-g178b122c1
Build System: [idf.py]
Compiler version (run xtensa-esp32-elf-gcc --version to find it):xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0
Operating System: [Linux]
Using an IDE?: [No]
Power Supply: [USB]

Problem Description

The application is a bridging project - connecting a TCP-IP socket to UART. The speeds need to be really high, so the project is pretty highly tuned. The test environment is host code (Linux) to open a TCP connection to the ESP board, then UART looped back in hardware. Messages are sent and received, testing the content for errors at each step.

During development I began to notice huge error rates in the messages coming back from the UART. What follows is the result of a multi-day attempt to track the bug down.

When using a UART (baud > 3,000,000) in a multitasking environment, where SPIRAM is also being used in other processes, there are frequent character errors.

with no other high priority code, things are fine
use no SPIRAM, things are fine
go 3Mbps or slower, things are fine

It is interesting to note that the problem is not caused by the use of SPIRAM in the UART code, the cause seems to be the interruption of code using SPIRAM. In the code below it seems to be the dummy task manipulating SPIRAM that causes the bad UART characters.

While it might be possible to work around this (by using lower baud rates, for example) it seems that an error like this is particularly nasty and might manifest in other places, causing even harder to find bugs and then other luckless engineers to spiral into days of debugging.

Expected Behavior

I expect mostly error free UART TX / RX. It should be noted that except for the conditions noted here, the UART works flawlessly even at 5,000,000 bps, and even with a rather confrontational multitasking environment. (This is really great, BTW!)

Actual Behavior

If the above conditions hold, (>3Mbps, SPIRAM being used) there are frequent errors.

Steps to reproduce

Grab a bare ESP-WROVER-KIT board
Connect pins 5 & 18
Grab a copy of the uart_async_rxtxtasks example
replace the code with the following gist - https://gist.github.com/davidthings/e06081ce465a0dfbd0df52074b88806f
idf.py menuconfig, and enable SPIRAM, setting the threshold to 512 (to force SPIRAM use for larger allocations)
select a test (see below) by setting the Test #define to some value
build, flash, run

A good run just ticks up good messages

Testing UART RX TX with SPIRAM Conflict Tests

    Test 0
        TX RX Baud Rate 4000000
        TX RX Task (or ~MX Task) 1
        TX RX Task Priority 10
        TX RX Memory on Heap 0
        UART TX buffer size 0
        Using Null Task 0
        Do Task Monitoring 0

Messages 0

Messages 400

Messages 800

Messages 1200

Messages 1600

Messages 2000

Messages 2400

A bad test

Testing UART RX TX with SPIRAM Conflict Tests

    Test 5
        TX RX Baud Rate 4000000
        TX RX Task (or ~MX Task) 1
        TX RX Task Priority 8
        TX RX Memory on Heap 0
        UART TX buffer size 4094
        Using Null Task 1
        Null Task Memory Size 100000
        Null Task Memory On Heap 1
        Null Task Priority 7
        Do Task Monitoring 1

Messages 0
        126: 254 != 126  (Total Errors 1)
        248: 146 != 120  (Total Errors 2)
        0: 127 != 255  (Total Errors 3)
        126: 25 != 253  (Total Errors 4)
        0: 126 != 254  (Total Errors 5)
        124: 23 != 250  (Total Errors 6)
        0: 125 != 253  (Total Errors 7)
        122: 18 != 247  (Total Errors 8)
        0: 124 != 252  (Total Errors 9)
        125: 22 != 249  (Total Errors 10)
        0: 123 != 251  (Total Errors 11)
        124: 21 != 247  (Total Errors 12)
        0: 122 != 250  (Total Errors 13)
        121: 18 != 243  (Total Errors 14)
        0: 120 != 248  (Total Errors 15)
        121: 16 != 241  (Total Errors 16)
        0: 118 != 246  (Total Errors 17)
        125: 18 != 243  (Total Errors 18)
        0: 116 != 244  (Total Errors 19)
        123: 14 != 239  (Total Errors 20)
        0: 115 != 243  (Total Errors 21)
        122: 12 != 237  (Total Errors 22)
        0: 114 != 242  (Total Errors 23)
        121: 10 != 235  (Total Errors 24)
        0: 113 != 241  (Total Errors 25)
        120: 6 != 233  (Total Errors 26)
        0: 112 != 240  (Total Errors 27)
        127: 11 != 239  (Total Errors 28)
        0: 111 != 239  (Total Errors 29)

Note that the errors are very repetitive and mostly amount to a single byte here and there.

Code to reproduce this issue

The gist contains code to reproduce the issue. There's a #define Test that lets you try different setups.

Test = 0 - UART TX and RX, with no TX Buffer, and with no interfering (NULL) task
Test = 1 - as above but adding in the NULL task doing busy work with static memory at lower priority as UART task
Test = 2 - as above but adding adding TX Buffer
Test = 3 - as above but adding another task to gather and print out task info
Test = 4 - as above but NULL task uses SPIRAM at higher priority
Test = 5 - as above but NULL task at lower rather than higher priority than the base task - lots of errors

There are other tests included for completeness. Perhaps they are helpful.

The key messaging code is reproduced below. In an infinite loop, we write a bunch of bytes in an increasing sequence to the UART, pause for a bit, then read them back. If the bytes are out of sequence, they are printed out. Pause and repeat.

The rest of the code sets up other dummy tasks.


    while (1) {

        for ( int i = 0; i < MESSAGE_LENGTH; i++ ) {
            txBuffer[ i ] = txSequence;
            txSequence = ( txSequence + 1 ) & 0xFF;
        }
        uart_write_bytes(UART_NUM_1, txBuffer, MESSAGE_LENGTH);
        sent++;

        vTaskDelay(10 / portTICK_PERIOD_MS);

        while (1) {
            const int rxBytes = uart_read_bytes(UART_NUM_1, rxBuffer, RX_BUF_SIZE, 0);
            if (rxBytes > 0) {
                for ( int i = 0; i < rxBytes; i++ ) {
                    if ( rxBuffer[i] != rxSequence ) {
                        rxErrors++;
                        printf("        %d: %d != %d  (Total Errors %d)\n", i, (int)rxBuffer[i], (int)rxSequence, rxErrors );
                    }
                    rxSequence = ( rxBuffer[i] + 1 ) & 0xFF;
                }
            } else
                break;
        }
        vTaskDelay(10 / portTICK_PERIOD_MS);

    }

https://gist.github.com/davidthings/e06081ce465a0dfbd0df52074b88806f

sdkconfig.txt

Alvin1Zhang commented 3 years ago

Thanks for the detailed report and letting us know, we will look into.

davidthings commented 3 years ago

further experimentation shows having a UART TX Buffer does increase the likelihood of data errors

davidthings commented 3 years ago

Working back with the bridge project (WiFi and UART) it looks like errors can occur even without SPIRAM. These may be a different issue.

            Receiving 
                @ 238 153 != 26
                @ 239 154 != 27
                @ 240 155 != 28
                @ 241 156 != 29
                @ 242 157 != 30
                @ 243 158 != 31
                @ 244 159 != 32
                @ 245 160 != 33
                @ 246 161 != 34

What's interesting about these errors is that the errors are not random. Because the test is to transfer a sequence of bytes, the errors above are telling. Instead of 9 bytes 153 - 151, we get 26 - 34. Was this strip of bytes just not transferred (leaving the old value in the buffer)? For 8 bytes were we looking at some other location?

(This was run at 240MHz, -O2, without SPIRAM)

negativekelvin commented 3 years ago

There are some uart fixes in master I don't know if they have been backported to other branches. Also you may have to adjust fifo full threshold and possibly patch in support for bigger fifo at that speed

davidthings commented 3 years ago

The testing was based on the latest Master (from a few days ago) so those should have included the UART fixes you mentioned.

So this seems more like an overrun than a context switching bug to you, @negativekelvin? The hope was to eliminate this possibility by some of the tests that avoid using SPIRAM - but nevertheless push everything pretty hard.

My hope is still that someone very close to this code gets a weird feeling about the situation and realizes that there's something obvious that's happening and can fix it!

davidthings commented 3 years ago

Reading https://github.com/espressif/esp-idf/issues/5101 I see the OP maintains that the FIFO still "stutters" periodically even with the fifo_cnt fix advocated in the workarounds document. The OP's description matches my experience to some degree.

@ginkgm the comiit has been published and i see that you now use rd/wr_addr to calculate queue length. this is part of the fix but not all of it. what i observed is when the "stutter" happens, fifo_cnt is updated while rd_addr and wr_addr are not. it's almost as if the read does not propagate to the FIFO circuit and returns old data that was stored in the bus register but gets far enough to update fifo_cnt (hence why i call it stuttering - it returns a byte we had already read before). this is why the most important part of my fix is where we read the byte from the fifo: after the read we check if rd_addr has been updated, meaning the byte that was returned actually comes from the FIFO and not a stale one.

For reference, here's the current ESP-IDF master implementation

/**
 * @brief  Get the length of readable data in UART rxfifo.
 *
 * @param  hw Beginning address of the peripheral registers.
 *
 * @return The readable data length in rxfifo.
 */
static inline uint32_t uart_ll_get_rxfifo_len(uart_dev_t *hw)
{
    uint32_t fifo_cnt = hw->status.rxfifo_cnt;
    typeof(hw->mem_rx_status) rx_status = hw->mem_rx_status;
    uint32_t len = 0;

    // When using DPort to read fifo, fifo_cnt is not credible, we need to calculate the real cnt based on the fifo read and write pointer.
    // When using AHB to read FIFO, we can use fifo_cnt to indicate the data length in fifo.
    if (rx_status.wr_addr > rx_status.rd_addr) {
        len = rx_status.wr_addr - rx_status.rd_addr;
    } else if (rx_status.wr_addr < rx_status.rd_addr) {
        len = (rx_status.wr_addr + 128) - rx_status.rd_addr;
    } else {
        len = fifo_cnt > 0 ? 128 : 0;
    }

    return len;
}

espressif / esp-idf