RADIANTs are locking up and leave UART hanging

ryankrebs016 commented 3 months ago

The RADIANTs deployed in the field are individually locking up every few weeks which need a full power cycle to recover from. These screenshots from Felix show that the UART is hanging during a bunch of UART reads to the FPGA and then isn't able to identify the FPGA on following runs. The board manager seems to be working okay.

image (15) image (16)

barawn commented 3 months ago

If the board manager's working fine, it should be able to toggle FPGA_PROGRAM_B which will do the functional equivalent of a full power cycle on the FPGA anyway, which it should be able to see by watching the FPGA_DONE pin toggle after it does that (although I think it'll be in bootloader mode then? Not sure).

Assuming the FPGA actually does reprogram (and if it doesn't something is really weird) and still won't talk to the board manager, it's possible that this is something like the situation where the baud rates between the two mismatch enough (or who knows, some other weird issue in the board manager).

Note that if the FPGA does come up in bootloader mode it's possible to check if it's responding by just ignoring the board manager and checking to see if the SPI flash pins respond properly: the bootloader firmware connects the SPI flash pins unconditionally, so the board manager UART doesn't matter. That'd point straight to an issue on the board manager side, obviously.

(It also probably would make sense to add a JTAG user status register as well that can be polled since that's independent of the UART anyway)

cozzyd commented 3 months ago

So adding a bit here:

I have implemented a COBS reset in the BM firmware, but doing that doesn't help, suggesting that the problem isn't something like a dropped byte.
I stupidly haven't checked if the trigger GPIO/SPI stuff is still working after the UART locks up. I will try that next time I see a lockup. If that stuff is still working, then that makes it more likely that it's an issue in the BM interface state machine. This should actually be obvious from the acquisition logs (events coming in but attempts at servoing failing), but they are not kept for too long.
I am close to having implemented a server implementing the XVC protocol, though obviously would need to be smoke-tested in the lab first and we should obviously try to reproduce the lockup in the lab first.
I don't know if a UART rate mismatch is that likely given that power cycling always works, which I wouldn't expect in the case of a UART rate mismatch between the BM and FPGA? We have seen evidence of a UART rate mismatch between the SBC and BM though.

barawn commented 3 months ago

It probably makes sense to implement a "read JTAG register" function in the board manager, since you could start off by reading innocuous stuff like the JTAG ID, and then we could add status/control via the USER registers. Debugging's going to be borderline impossible if it's week-scale, so the smarter thing is probably to just figure out the quickest reset method.

I agree that I can't figure out how a baud rate mismatch would work either, but I can't figure out how a week-scale failure would occur anyway, unless you're actually talking about power or clock glitches and such.

What's the build that's installed on the RADIANTs in the field? Is it mine (the v0r3p3 build) or was it one that's not up here?

"more likely that it's an issue in the BM interface state machine. "

You could conceivably get to a point where everything's locked up and a COBS reset won't work because the UART has a FIFO internally, so if no one's reading (because the COBS output FIFO is full) the UART won't present the data the reset detection uses.

You could detect that by looking for a UART RX FIFO overflow, but the problem there is that there's no easy way to communicate that off-board, which is probably why it's better to just add a JTAG control register to forcibly reset things so that you know something went wrong.

cozzyd commented 3 months ago

The RADIANTs are currently all running 0.6.0, though this problem did precede that version (though maybe was less common?).

Arguing against it being the BM iterface is that it feels (without hard data backing it up) that it is more likely to lock up at high data rates. If that's true, high data rates shouldn't really affect the UART link all that much since that's just slow control data during running (reading scalers / setting thresholds), which shouldn't depend much on the data rate, unless something is really funky. High data rates would conceivably fill up the metadata FIFO though, and maybe that can cause problems other than mismatches?

Adding some basic JTAG functionality in the BM is probably better than remote arbitrary control. I can start working on that (unless @ryankrebs016 or @fschlueter wants to...).

barawn commented 3 months ago

High data rates could be power too, I guess.

0.6.0 is one of my builds though, I just never set it as current. Fixing that now.

barawn commented 3 months ago

Also I don't think the high data rates can interact with the board manager - the only thing they basically share in common is the arbitrated WISHBONE bus, but both the SPI output (DMA) and board manager interface are "slow" in that they can't issue back-to-back WISHBONE transfers, and when they issue the WISHBONE transaction the only thing they wait on is the bus itself.

The way I designed the SPI DMA core is that it never waits on both of its interfaces (the output SPI FIFO and the WISHBONE bus) at the same time, so it can't deadlock that way - it completes the WB transaction and then tries to write the result (which it's got stored) into the FIFO - so if the FIFO fills, it just sits there waiting with the WB bus free (and likewise for the reverse it sits there waiting for data from SPI, and then issues the transaction after it receives it).

@cozzyd might want to actually try to implement the whole powergood interrupt stuff in the board manager. I might be able to set alarms in the FPGA as well or something, not sure (but again, the whole 'how to tell you' issue). Seems unlikely since the 1V0/1V8 rails are monstrously overspec'd, but who knows.

cozzyd commented 2 months ago

Now it's pretty clear that it's not just the UART that freezes. From inspecting log output just after it stalls, no data is read in either, suggesting at least the GPIO "trigger ready" is not working (and indeed, the state of the GPIO is low after stopping the DAQ).

RNO-G / firmware-radiant

RADIANTs are locking up and leave UART hanging #18