espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.72k stars 7.3k forks source link

uart_flush blocks forever (IDFGH-7778) #9315

Open Superberti opened 2 years ago

Superberti commented 2 years ago

Environment

Problem Description

I'm communication with some other device over a serial RS422 line with a high comm rate of 1MBPS with no flow control. Receiving and transmitting works absolutely reliable. I only had to move the UART driver to the second core in order to get no buffer overruns in the hardware FIFO (although the UART software ring buffer is always big enough for the data so this seems to be an interrupt latency problem on core 0). One task is listening to the input of the UART by a queue if (xQueueReceive(spp_uart_queue, (void *)&event, (portTickType)portMAX_DELAY))... and another task is sending back some data to the UART. Now the sending task wants to flush the UART (some time after successfully receiving a lot of UART data) with uart_flush(UART_NUM_1); and the function blocks forever. In this moment my second device does not send any data, so the UART input buffer is not filled. I don't know how this can happen as uart_flush should never block! But if I send some data from my second device to the UART after the blocking of uart_flush, the function unblocks immediately! After that the program works as expected again. Fortunately, I do not really need to flush the UART(input) in my sending task, but this behaviour is strange and may lead to serious problems. Maybe the function hangs in some of the spinlocks?

Expected behaviour

The function uart_flush should never block.

Actual behaviour

The function blocks in some scenarious

Steps to reproduce

Unfortunately I don't have any kind of simple test code right now and it needs a second device which is sending "real" data to the UART. I know, this is not very satisfying but maybe my desciption helps to find a potential problem in the driver.

moefear85 commented 2 years ago

I hope nobody takes this the wrong way, and I was going to wait before mentioning this, but I share your thoughts and there is atleast one other person who is concerned with the current state of the uart driver.

Specifically because of similar hanging with rs485 as well as usb-cdc, and there is the still unresolved issue of misdetected uart break positions. Moreover, I suspect a uart rx corruption issue under load (by load I mean the wifi is running). I monitor it unmistakably, but I haven't finished an isolated test code to reproduce the issue, so I can be sure I haven't made any mistakes myself elsewhere (but I doubt it). I've created a testing stream from node A to node B, but the problem isn't yet manifesting itself because the system isn't under load the way it is in my actual project. Specifically, when using the uart event subsystem, under load, event.size will be larger than what uart_get_buffered_data_len() returns. Usually, both will be "120" under load, but sporadically, event.size will return 120 while uart_get_buffered_data() returns 118. It is exactly at that moment that I detect corruption in the data stream. Without stress, they always match, even if less than 120. One might think, it's corruption on the wires. But no framing/parity/full/OVF errors are ever raised, nor any breaks detected. Moreover if it were corruption, it should then also occur even without size mismatches. But it never does. I do alot of heavy logging of almost identical lines very often. It's something to celebrate whenever I do detect a corrupted byte anywhere in the stream. Either way I know for sure the wires are very clean and quiet. The corruption is happening in the rx FIFO or afterwards. One might think, if the uart were responsible, it would manifest itself also in the output logs. not really, since that is the tx direction, not rx. Moreover I sometimes split/T the uart channels, to monitor when two esps are communicating together over uart. If the corruption happened on the line, it would have to be detected on the monitor, but it isn't.

I think there is still something wrong either with the uart-driver, and/or with concurrency management. My worry is one or both might be a silicon issue, meaning I wouldn't be able to use an esp for any serious project for a long time into the future. Either way, I'm working on tracing the problem and will report it to espressif once I understand what is going on. I've only started studying the soc/hal structure, the uart driver, and the xtensa ISA.

moefear85 commented 2 years ago

@Superberti

still, I wonder... are you sure the flush function itself is blocking? maybe it is completing but afterwards it is blocking on xQueueRecieve waiting for new input? When flushing the buffer, I think (from the example), it is also necessary to reset the queue. Otherwise it leads to unexpected results, such as the next pending event still firing despite the input buffer being fully empty, hence any uart read operation in UART_DATA then actually hanging while it is waiting for actual new data to arrive on uart. to detect all of these cases, it is often best to set timeouts, then check the return values for errors and atleast recover from endless blocking.

Superberti commented 2 years ago

Hi,

it is definitively the flush function. I made logs before and after uart_flush and without any new uart input in hangs in this function.

moefear85 commented 2 years ago

I suspect uart_flush is not callable from a writing task if there is a separate reading task running (since xQueueRecieve will be blocked while it is accessing the underlying readbuffer or related locks/semaphores), while the writing task would then also attempt to affect parts of the buffer or specific semaphores that the recieving part is concurrently accessing or access those locks/semaphores. for similar reasons, i know in freertos, a freertos queue can only be reset (ie flushed) when there is no currently blocked operation on it, otherwise it also hangs. the driver uses esp additions (ringbuffers), but I assume similar applies.

you could verify this by making sending notifications from the writing task to the reading task so that the reading task itself does the flushing. the reading task would have to be made to timeout though when calling xQueueReceive so it can check for notifications.

Superberti commented 2 years ago

Maybe that's the case. But unfortunately it is not documented anywhere. And it's strange that uart_flush does not block every time (called from a different task) but only in some scenarios. In my case it's ok not to call the flush function at all from the non-listening task. It is a bit of a pity that the xTaskNotifyWait(Indexed) functions are limited to one notification per task, so all the XYZIndexed functions are useless (the limit is hardcoded in the FreeRTOS-Header).

Bye, Oliver

negativekelvin commented 2 years ago

I only had to move the UART driver to the second core in order to get no buffer overruns in the hardware FIFO (although the UART software ring buffer is always big enough for the data so this seems to be an interrupt latency problem on core 0).

Do you have uart isr in iram via menuconfig?

Superberti commented 2 years ago

Ah, good hint, it was not checked! The help tells me: "If this option is not selected, UART interrupt will be disabled for a long time and may cause data lost when doing spi flash operation."

Bye, Oliver