Open davidthings opened 3 years ago
Thanks for the detailed report and letting us know, we will look into.
Working back with the bridge project (WiFi and UART) it looks like errors can occur even without SPIRAM. These may be a different issue.
Receiving
@ 238 153 != 26
@ 239 154 != 27
@ 240 155 != 28
@ 241 156 != 29
@ 242 157 != 30
@ 243 158 != 31
@ 244 159 != 32
@ 245 160 != 33
@ 246 161 != 34
What's interesting about these errors is that the errors are not random. Because the test is to transfer a sequence of bytes, the errors above are telling. Instead of 9 bytes 153 - 151, we get 26 - 34. Was this strip of bytes just not transferred (leaving the old value in the buffer)? For 8 bytes were we looking at some other location?
(This was run at 240MHz, -O2, without SPIRAM)
There are some uart fixes in master I don't know if they have been backported to other branches. Also you may have to adjust fifo full threshold and possibly patch in support for bigger fifo at that speed
The testing was based on the latest Master (from a few days ago) so those should have included the UART fixes you mentioned.
So this seems more like an overrun than a context switching bug to you, @negativekelvin? The hope was to eliminate this possibility by some of the tests that avoid using SPIRAM - but nevertheless push everything pretty hard.
My hope is still that someone very close to this code gets a weird feeling about the situation and realizes that there's something obvious that's happening and can fix it!
Reading https://github.com/espressif/esp-idf/issues/5101 I see the OP maintains that the FIFO still "stutters" periodically even with the fifo_cnt fix advocated in the workarounds document. The OP's description matches my experience to some degree.
@ginkgm the comiit has been published and i see that you now use rd/wr_addr to calculate queue length. this is part of the fix but not all of it. what i observed is when the "stutter" happens, fifo_cnt is updated while rd_addr and wr_addr are not. it's almost as if the read does not propagate to the FIFO circuit and returns old data that was stored in the bus register but gets far enough to update fifo_cnt (hence why i call it stuttering - it returns a byte we had already read before). this is why the most important part of my fix is where we read the byte from the fifo: after the read we check if rd_addr has been updated, meaning the byte that was returned actually comes from the FIFO and not a stale one.
For reference, here's the current ESP-IDF master implementation
/**
* @brief Get the length of readable data in UART rxfifo.
*
* @param hw Beginning address of the peripheral registers.
*
* @return The readable data length in rxfifo.
*/
static inline uint32_t uart_ll_get_rxfifo_len(uart_dev_t *hw)
{
uint32_t fifo_cnt = hw->status.rxfifo_cnt;
typeof(hw->mem_rx_status) rx_status = hw->mem_rx_status;
uint32_t len = 0;
// When using DPort to read fifo, fifo_cnt is not credible, we need to calculate the real cnt based on the fifo read and write pointer.
// When using AHB to read FIFO, we can use fifo_cnt to indicate the data length in fifo.
if (rx_status.wr_addr > rx_status.rd_addr) {
len = rx_status.wr_addr - rx_status.rd_addr;
} else if (rx_status.wr_addr < rx_status.rd_addr) {
len = (rx_status.wr_addr + 128) - rx_status.rd_addr;
} else {
len = fifo_cnt > 0 ? 128 : 0;
}
return len;
}
Environment
git describe --tags
to find it): v4.2-dev-3672-g178b122c1xtensa-esp32-elf-gcc --version
to find it):xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0Problem Description
The application is a bridging project - connecting a TCP-IP socket to UART. The speeds need to be really high, so the project is pretty highly tuned. The test environment is host code (Linux) to open a TCP connection to the ESP board, then UART looped back in hardware. Messages are sent and received, testing the content for errors at each step.
During development I began to notice huge error rates in the messages coming back from the UART. What follows is the result of a multi-day attempt to track the bug down.
When using a UART (baud > 3,000,000) in a multitasking environment, where SPIRAM is also being used in other processes, there are frequent character errors.
It is interesting to note that the problem is not caused by the use of SPIRAM in the UART code, the cause seems to be the interruption of code using SPIRAM. In the code below it seems to be the dummy task manipulating SPIRAM that causes the bad UART characters.
While it might be possible to work around this (by using lower baud rates, for example) it seems that an error like this is particularly nasty and might manifest in other places, causing even harder to find bugs and then other luckless engineers to spiral into days of debugging.
Expected Behavior
I expect mostly error free UART TX / RX. It should be noted that except for the conditions noted here, the UART works flawlessly even at 5,000,000 bps, and even with a rather confrontational multitasking environment. (This is really great, BTW!)
Actual Behavior
If the above conditions hold, (>3Mbps, SPIRAM being used) there are frequent errors.
Steps to reproduce
uart_async_rxtxtasks
exampleTest
#define to some valueA good run just ticks up good messages
A bad test
Note that the errors are very repetitive and mostly amount to a single byte here and there.
Code to reproduce this issue
The gist contains code to reproduce the issue. There's a #define
Test
that lets you try different setups.There are other tests included for completeness. Perhaps they are helpful.
The key messaging code is reproduced below. In an infinite loop, we write a bunch of bytes in an increasing sequence to the UART, pause for a bit, then read them back. If the bytes are out of sequence, they are printed out. Pause and repeat.
The rest of the code sets up other dummy tasks.
https://gist.github.com/davidthings/e06081ce465a0dfbd0df52074b88806f
sdkconfig.txt