Closed rtadros125 closed 4 years ago
@rod-chapman FYI, and please comment if you have additional input or better description. Thanks.
Does the driver make sure to use a critical section when polling, so that it can’t poll, be preempted and then erroneously think the UART is ready for output? Is the driver ensuring its polling is volatile and not having the loop optimised away?
Forgot to mention that this is for FreeRTOS. Let's wait for whoever will work on this ticket to investigate. Thanks for providing some starting questions.
This is specifically for the UART that we get when using Chisel_P1 on FireSim on AWS F1. The driver code is in the FreeRTOS repo in Demo/RISC-V_Galois_P1/bsp/uart_sifive.c
@podhrmic - does anyone have a definitive specification or data-sheet for this UART device. For instance - the driver polls TXFIFO in a loop to see if the FIFO is empty (and therefore can be written to). Who says that should work?
FYI - I look at the generated code for the polling loop, and it's OK... the access to the TXFIFO register is declared as volatile, and the compiler the doing the right thing.
Chapter 13 of https://static.dev.sifive.com/FU540-C000-v1.0.pdf
Aha! That's that I need... thanks...
#ifdef __riscv_atomic
...
#else
volatile uint32_t *tx = uart + UART_REG_TXFIFO;
// TODO: change this to something better than a busy wait
while ((int32_t)(*tx) < 0)
;
...
#endif
For a P1 you lack A and so use this branch of the code. I believe that cast is strictly undefined behaviour as it overflows a signed integer value, and thus the loop can be optimised out.
Some useful info:
The generated code at -O0 looks like:
volatile uint32_t tx = uart + UART_REG_TXFIFO;
c007b800: 0092b797 auipc a5,0x92b
c007b804: 79078793 addi a5,a5,1936 # c09a6f90
while ((int32_t)(
Does that look OK to you?
I also believe the documentation is misguided and suggests you do something that would risk duplicating characters. When you have atomics, it suggests using an AMOSWAP (the FreeRTOS driver is using an AMOOR, both seem equivalent) to read the FIFO and attempt to enqueue, with the result of the AMOSWAP including whether the FIFO was busy.
However, AMOs are generally not atomic for memory-mapped devices (though they may be on the FU540, I don't know). Instead they're done as ready-modify-write operations, and there is a window between loading TXFIFO, performing the modification and storing the result back, within which the FIFO could become not-full, so the subsequent store part of the swap actually enqueues to the FIFO, but you see the prior full status, and thus believe you need to retry. I strongly suggest the __riscv_atomic
branch of the driver be removed as a result; I know for a fact that the Bluespec cores will not be able to do this atomically, and am skeptical that the Chisel cores would behave any differently, as it requires plumbing your AMOs all the way through the bus to the devices (but maybe TileLink supports that...).
The generated code at -O0 looks like: volatile uint32_t tx = uart + UART_REG_TXFIFO; c007b800: 0092b797 auipc a5,0x92b c007b804: 79078793 addi a5,a5,1936 # c09a6f90 c007b808: 0007a783 lw a5,0(a5) c007b80c: fef42623 sw a5,-20(s0) // TODO: change this to something better than a busy wait while ((int32_t)(tx) < 0) c007b810: 00000013 nop c007b814: fec42783 lw a5,-20(s0) c007b818: 0007a783 lw a5,0(a5) c007b81c: fe07cce3 bltz a5,c007b814 <uart_putchar+0x28>
Does that look OK to you?
Yes, that looks fine. I just tried compiling that snippet myself, with both GCC and Clang, at both -O0 and -O2, and every combination generated the desired code. Maybe this is instead implementation-defined behaviour and not regarded as the more general signed integer overflow category.
Thanks for checking...
Ok, having said that, the AMOOR.W is different from AMOSWAP.W and in fact safer. There is undocumented behaviour in the implementation used for FireSim (and maybe also the FU540) that if your 32-bit write includes the full bit set the write is quashed. Thus even if the FIFO becomes not full during your read-modify-write it's safe to write back. So I think the __riscv_atomic
case is fine.
Is it possible things are calling uart0_rxchar/uart0_txchar? Those are not subject to the uart_rx_mutex/uart_tx_mutex that currently serve two purposes:
Thus uart0_rxchar/uart0_txchar need to acquire the same mutexes as their buffer counterparts.
@jrtc27 good point. I am checking the "objdump" output on the F1 instance now...
Nope.. uart0_txchar() isn't called anywhere, so that's not it... :-(
Hmm... I tried adjusting the task priorities so that the OTA task had higher priority than the HTTP task - I thought this might alter the interleaving of them and thus the results that we see. Interestingly, this seems to make no difference.
I have made and tested a few more hypotheses this morning, thus:
Next step: I will revert steps 1, 2 and 4 (keeping 3 and 5) and see what happens.
OK... I have reverted ideas 1, 2, and 4 above, and it still works OK. :-) Don't know why. :-(
I'm concerned by the (apparent) lack of flow control on this UART. The SiFive UART spec says there's no hardware flow control, and I can't see anyway of implementing software-based (e.g. XON/XOFF stuff) control, since we have no idea what's going on at the receiving end..
So other than "don't send too much data", what can we do?
@podhrmic @dhand-galois Do you two have any idea what at the "receiving end" of our UART? Is it a piece of the FireSim IP? Do you know if it implements any additional buffering and/or flow control of the data? If there's a buffer, how big is it? Can we ask the FireSim people for help with this?
@rod-chapman There are two parts to the "other side" of the UART. The first is a hardware UARTBridge, defined here: https://github.com/DARPA-SSITH-Demonstrators/firesim/blob/c988de0bbef061763e9cad7a65a81eb577c71bc7/sim/firesim-lib/src/main/scala/bridges/UARTBridge.scala#L78
It converts between messages to/from the host via MMIO to tx/rx to the actual UART. There are two 128 entry FIFOs to buffer Rx/Tx characters. I'd have to dig into more code to really confirm, but I believe both of these FIFOs have proper flow control. The host -> target flow control is obvious - if the 128 character buffer fills up, the FIFO ready
signal is low and the host should stop sending new data. The target -> host direction is less clear. Looking at the Verilog and based on how FireSim works, I expect that the entire SoC (including the UART) would be clock gated if the Tx FIFO filled up.
The second piece is this code that runs on the host and communicates with the UARTBridge: https://github.com/DARPA-SSITH-Demonstrators/firesim/blob/c988de0bbef061763e9cad7a65a81eb577c71bc7/sim/firesim-lib/src/main/cc/bridges/uart.cc#L108
The FireSim github is very active and usually responsive. They may be able to help.
One other thing to note - because we changed the simulated speed of the SoC from 3.2GHz to 100MHz but did not change the baud rate, the rate divisor dropped from 868 to 27. If there is a flow control issue somewhere, perhaps decreasing our baud rate would help? I can generate a new AFI as a test case. As an additional data point, although the timing is quite different, I haven't experienced any loss of content using the SiFive driver on Linux.
Another idea - in that same uart.cc
, there's an option to generate a standard pseudo-terminal device rather than the interactive console.
The comment in uart_sifive.c says that 27 is chosen to get a baud rate of 3686400, but that'a not quite right. The SiFive manual says you divide the clock rate by that + 1, so something like 100,000,000 / 28 which is 3571428.571 - which is a bit off. Could this be a source of problems?
It's ~3.1% off, which is getting to the realms of too great, so maybe. Isn't FireSim also clocking the core slower than the claimed 100 MHz, which would make this even further off? Or will both ends be accounting for this?
@dhand-galois I think having a more sensible baudrate would be worth trying - can you generate something around 115k baud?
I will try a little stress tests and see when UART starts dropping characters
@rod-chapman That's something I noticed as well, but I followed the existing code's lead on the discrepancy. It's possible the "+1" is also accounted for in UARTBridge
, as it's not the actual baud rate that matters - only that both sides of tx/rx implement the counter in the same way. The baud rate is configured by the driver, so you could try setting it to 26 to see if there is any improvement without making any hardware changes.
@jrtc27 The UARTBridge and UART operate at the same (real) clock frequency. The main difference is the core design (containing the UART) can be clock gated separate from the bridges. But judging from my quick read through the bridge code, they've implemented that properly and the UART bridge will pause while the core is stalled.
@podhrmic I can certainly do that. I'll get that building shortly.
@podhrmic @rod-chapman Here's a P1 build with the baud rate set to 115,200: agfi-055d5b5652d76cda4
. It is compatible with the existing swpkg, so you can just open up run_sim.sh
and modify AGFI
. Or if you're using FETT to launch it, update the appropriate json file.
I tried 868
and 867
divisor values and both seemed to be working fine for the simple blinky test. Even if we're wrong on the +1 aspect, the error rate is insignificantly low now.
OK... I will try that if I can work out how. I assume I'll need to change the divisor constant in uart_sifive.c to 868 to match, right?
Yes, that's where I changed it for my test build.
I have moved this issue to the current sprint/milestone and assigned it to those performers who are actively working it, given the above discussion.
I did not get time to try this today...
I am testing it now with the new AGFI and Divisor set to 867...
@rod-chapman did the latest Dylan's changes (hard coded baud rate of 115200) work for you as expected?
Seems like this has been resolved with the new changes, so going to close this issue. If the buffering issue comes up again, please re-open.
If we use printf in a few task, and print smth a bit long. The driver acts weird. We get output truncated and it just dies. Or sometimes it just ignore a big part then continue fine afterwards.