GaloisInc / BESSPIN-CloudGFE

The AWS cloud deployment of the BESSPIN GFE platform.
Apache License 2.0
2 stars 2 forks source link

Firesim FreeRTOS UART driver does not seem to handle buffering properly #101

Closed rtadros125 closed 4 years ago

rtadros125 commented 4 years ago

If we use printf in a few task, and print smth a bit long. The driver acts weird. We get output truncated and it just dies. Or sometimes it just ignore a big part then continue fine afterwards.

rtadros125 commented 4 years ago

@rod-chapman FYI, and please comment if you have additional input or better description. Thanks.

jrtc27 commented 4 years ago

Does the driver make sure to use a critical section when polling, so that it can’t poll, be preempted and then erroneously think the UART is ready for output? Is the driver ensuring its polling is volatile and not having the loop optimised away?

rtadros125 commented 4 years ago

Forgot to mention that this is for FreeRTOS. Let's wait for whoever will work on this ticket to investigate. Thanks for providing some starting questions.

rod-chapman commented 4 years ago

This is specifically for the UART that we get when using Chisel_P1 on FireSim on AWS F1. The driver code is in the FreeRTOS repo in Demo/RISC-V_Galois_P1/bsp/uart_sifive.c

rod-chapman commented 4 years ago

@podhrmic - does anyone have a definitive specification or data-sheet for this UART device. For instance - the driver polls TXFIFO in a loop to see if the FIFO is empty (and therefore can be written to). Who says that should work?

rod-chapman commented 4 years ago

FYI - I look at the generated code for the polling loop, and it's OK... the access to the TXFIFO register is declared as volatile, and the compiler the doing the right thing.

jrtc27 commented 4 years ago

Chapter 13 of https://static.dev.sifive.com/FU540-C000-v1.0.pdf

rod-chapman commented 4 years ago

Aha! That's that I need... thanks...

jrtc27 commented 4 years ago
#ifdef __riscv_atomic
    ...
#else
    volatile uint32_t *tx = uart + UART_REG_TXFIFO;
    //  TODO: change this to something better than a busy wait
    while ((int32_t)(*tx) < 0)
        ;
    ...
#endif

For a P1 you lack A and so use this branch of the code. I believe that cast is strictly undefined behaviour as it overflows a signed integer value, and thus the loop can be optimised out.

rod-chapman commented 4 years ago

Some useful info:

  1. The TX FIFO is 8 "entries" deep...
  2. There is no hardware flow control
rod-chapman commented 4 years ago

The generated code at -O0 looks like: volatile uint32_t tx = uart + UART_REG_TXFIFO;
c007b800: 0092b797 auipc a5,0x92b c007b804: 79078793 addi a5,a5,1936 # c09a6f90 c007b808: 0007a783 lw a5,0(a5) c007b80c: fef42623 sw a5,-20(s0) // TODO: change this to something better than a busy wait
while ((int32_t)(
tx) < 0) c007b810: 00000013 nop c007b814: fec42783 lw a5,-20(s0) c007b818: 0007a783 lw a5,0(a5) c007b81c: fe07cce3 bltz a5,c007b814 <uart_putchar+0x28>

Does that look OK to you?

jrtc27 commented 4 years ago

I also believe the documentation is misguided and suggests you do something that would risk duplicating characters. When you have atomics, it suggests using an AMOSWAP (the FreeRTOS driver is using an AMOOR, both seem equivalent) to read the FIFO and attempt to enqueue, with the result of the AMOSWAP including whether the FIFO was busy.

However, AMOs are generally not atomic for memory-mapped devices (though they may be on the FU540, I don't know). Instead they're done as ready-modify-write operations, and there is a window between loading TXFIFO, performing the modification and storing the result back, within which the FIFO could become not-full, so the subsequent store part of the swap actually enqueues to the FIFO, but you see the prior full status, and thus believe you need to retry. I strongly suggest the __riscv_atomic branch of the driver be removed as a result; I know for a fact that the Bluespec cores will not be able to do this atomically, and am skeptical that the Chisel cores would behave any differently, as it requires plumbing your AMOs all the way through the bus to the devices (but maybe TileLink supports that...).

jrtc27 commented 4 years ago

The generated code at -O0 looks like: volatile uint32_t tx = uart + UART_REG_TXFIFO; c007b800: 0092b797 auipc a5,0x92b c007b804: 79078793 addi a5,a5,1936 # c09a6f90 c007b808: 0007a783 lw a5,0(a5) c007b80c: fef42623 sw a5,-20(s0) // TODO: change this to something better than a busy wait while ((int32_t)(tx) < 0) c007b810: 00000013 nop c007b814: fec42783 lw a5,-20(s0) c007b818: 0007a783 lw a5,0(a5) c007b81c: fe07cce3 bltz a5,c007b814 <uart_putchar+0x28>

Does that look OK to you?

Yes, that looks fine. I just tried compiling that snippet myself, with both GCC and Clang, at both -O0 and -O2, and every combination generated the desired code. Maybe this is instead implementation-defined behaviour and not regarded as the more general signed integer overflow category.

rod-chapman commented 4 years ago

Thanks for checking...

jrtc27 commented 4 years ago

Ok, having said that, the AMOOR.W is different from AMOSWAP.W and in fact safer. There is undocumented behaviour in the implementation used for FireSim (and maybe also the FU540) that if your 32-bit write includes the full bit set the write is quashed. Thus even if the FIFO becomes not full during your read-modify-write it's safe to write back. So I think the __riscv_atomic case is fine.

jrtc27 commented 4 years ago

Is it possible things are calling uart0_rxchar/uart0_txchar? Those are not subject to the uart_rx_mutex/uart_tx_mutex that currently serve two purposes:

  1. Not interleaving text for multiple uart0_rxbuffer/uart0_txbuffer calls
  2. Ensuring preemption in the non-atomics case between polling and enqueuing doesn't cause issues (does not apply to receiving, since dequeuing-and-testing is an atomic operation with no need for hardware atomics)

Thus uart0_rxchar/uart0_txchar need to acquire the same mutexes as their buffer counterparts.

rod-chapman commented 4 years ago

@jrtc27 good point. I am checking the "objdump" output on the F1 instance now...

rod-chapman commented 4 years ago

Nope.. uart0_txchar() isn't called anywhere, so that's not it... :-(

rod-chapman commented 4 years ago

Hmm... I tried adjusting the task priorities so that the OTA task had higher priority than the HTTP task - I thought this might alter the interleaving of them and thus the results that we see. Interestingly, this seems to make no difference.

rod-chapman commented 4 years ago

I have made and tested a few more hypotheses this morning, thus:

  1. Flush the UART TxFIFO just before termination to make sure it's "empty". Result: no improvement.
  2. Turn off the Receiving UART (the uart0_init() code enables the Rx side ... no idea why, but seems unnecessary). Result: no improvement.
  3. Force the OTA task to take the ff mutex in Write_Payload_To_Log. Should not be necessary, and something of a sledgehammer to serialize the TTY Log so that the payload dump does not get interleaved with output from the HTTP task. Result: TTY output is now properly formatted, but TTY Log still truncated and test run fails.
  4. Limit to OTA Payload output to the TTY Log to 128 characters, not 256. Result: still fails, but fails later than before (suggesting that reducing the volume of the log generated by the OTA task is in some way relevant to what's going on.)
  5. Remove "Informational" log messages in ota.c and tftp_server.c - just reducing the volume again. This time the run completes OK (with steps 1, 2, 3 and 4 also in place.)

Next step: I will revert steps 1, 2 and 4 (keeping 3 and 5) and see what happens.

rod-chapman commented 4 years ago

OK... I have reverted ideas 1, 2, and 4 above, and it still works OK. :-) Don't know why. :-(

rod-chapman commented 4 years ago

I'm concerned by the (apparent) lack of flow control on this UART. The SiFive UART spec says there's no hardware flow control, and I can't see anyway of implementing software-based (e.g. XON/XOFF stuff) control, since we have no idea what's going on at the receiving end..

So other than "don't send too much data", what can we do?

rod-chapman commented 4 years ago

@podhrmic @dhand-galois Do you two have any idea what at the "receiving end" of our UART? Is it a piece of the FireSim IP? Do you know if it implements any additional buffering and/or flow control of the data? If there's a buffer, how big is it? Can we ask the FireSim people for help with this?

dhand-galois commented 4 years ago

@rod-chapman There are two parts to the "other side" of the UART. The first is a hardware UARTBridge, defined here: https://github.com/DARPA-SSITH-Demonstrators/firesim/blob/c988de0bbef061763e9cad7a65a81eb577c71bc7/sim/firesim-lib/src/main/scala/bridges/UARTBridge.scala#L78

It converts between messages to/from the host via MMIO to tx/rx to the actual UART. There are two 128 entry FIFOs to buffer Rx/Tx characters. I'd have to dig into more code to really confirm, but I believe both of these FIFOs have proper flow control. The host -> target flow control is obvious - if the 128 character buffer fills up, the FIFO ready signal is low and the host should stop sending new data. The target -> host direction is less clear. Looking at the Verilog and based on how FireSim works, I expect that the entire SoC (including the UART) would be clock gated if the Tx FIFO filled up.

The second piece is this code that runs on the host and communicates with the UARTBridge: https://github.com/DARPA-SSITH-Demonstrators/firesim/blob/c988de0bbef061763e9cad7a65a81eb577c71bc7/sim/firesim-lib/src/main/cc/bridges/uart.cc#L108

The FireSim github is very active and usually responsive. They may be able to help.

One other thing to note - because we changed the simulated speed of the SoC from 3.2GHz to 100MHz but did not change the baud rate, the rate divisor dropped from 868 to 27. If there is a flow control issue somewhere, perhaps decreasing our baud rate would help? I can generate a new AFI as a test case. As an additional data point, although the timing is quite different, I haven't experienced any loss of content using the SiFive driver on Linux.

dhand-galois commented 4 years ago

Another idea - in that same uart.cc, there's an option to generate a standard pseudo-terminal device rather than the interactive console.

rod-chapman commented 4 years ago

The comment in uart_sifive.c says that 27 is chosen to get a baud rate of 3686400, but that'a not quite right. The SiFive manual says you divide the clock rate by that + 1, so something like 100,000,000 / 28 which is 3571428.571 - which is a bit off. Could this be a source of problems?

jrtc27 commented 4 years ago

It's ~3.1% off, which is getting to the realms of too great, so maybe. Isn't FireSim also clocking the core slower than the claimed 100 MHz, which would make this even further off? Or will both ends be accounting for this?

podhrmic commented 4 years ago

@dhand-galois I think having a more sensible baudrate would be worth trying - can you generate something around 115k baud?

podhrmic commented 4 years ago

I will try a little stress tests and see when UART starts dropping characters

dhand-galois commented 4 years ago

@rod-chapman That's something I noticed as well, but I followed the existing code's lead on the discrepancy. It's possible the "+1" is also accounted for in UARTBridge, as it's not the actual baud rate that matters - only that both sides of tx/rx implement the counter in the same way. The baud rate is configured by the driver, so you could try setting it to 26 to see if there is any improvement without making any hardware changes.

@jrtc27 The UARTBridge and UART operate at the same (real) clock frequency. The main difference is the core design (containing the UART) can be clock gated separate from the bridges. But judging from my quick read through the bridge code, they've implemented that properly and the UART bridge will pause while the core is stalled.

@podhrmic I can certainly do that. I'll get that building shortly.

dhand-galois commented 4 years ago

@podhrmic @rod-chapman Here's a P1 build with the baud rate set to 115,200: agfi-055d5b5652d76cda4. It is compatible with the existing swpkg, so you can just open up run_sim.sh and modify AGFI. Or if you're using FETT to launch it, update the appropriate json file.

I tried 868 and 867 divisor values and both seemed to be working fine for the simple blinky test. Even if we're wrong on the +1 aspect, the error rate is insignificantly low now.

rod-chapman commented 4 years ago

OK... I will try that if I can work out how. I assume I'll need to change the divisor constant in uart_sifive.c to 868 to match, right?

dhand-galois commented 4 years ago

Yes, that's where I changed it for my test build.

kiniry commented 4 years ago

I have moved this issue to the current sprint/milestone and assigned it to those performers who are actively working it, given the above discussion.

rod-chapman commented 4 years ago

I did not get time to try this today...

rod-chapman commented 4 years ago

I am testing it now with the new AGFI and Divisor set to 867...

podhrmic commented 4 years ago

@rod-chapman did the latest Dylan's changes (hard coded baud rate of 115200) work for you as expected?

dhand-galois commented 4 years ago

Seems like this has been resolved with the new changes, so going to close this issue. If the buffering issue comes up again, please re-open.