Severson-Group / AMDC-Firmware

Embedded system code (C and Verilog) which runs the AMDC Hardware
http://docs.amdc.dev/firmware
BSD 3-Clause "New" or "Revised" License
33 stars 6 forks source link

New AMDS Interface not timing out correctly #393

Closed codecubepi closed 3 months ago

codecubepi commented 6 months ago

When the AMDS comes unplugged, the counters all freeze (namely the timeout counter). I believe this is because the fall on the AMDS data line is what starts the state machines and timeout counter counting. But if there is no data, the line will never fall and the SMs never start, so there is no timeout. This will also probably cause the timing manager to freeze, since the done signal for the AMDS will never be asserted.

Comment from Nathan: We need to make sure the timing manager never freezes, so probably have a user configurable time-out register. If sensors are enabled and every sensor isn't done by X time after trigger, still call the ISR, but indicate this in some error/status register

codecubepi commented 5 months ago

Update:

I made an attempt to fix this issue in commit 0d27cc0.

When unplugging the AMDS from the AMDC during operation, I could not replicate the freezing issue I initially encountered. The Timing Manager still sends out triggers, which means the AMDS driver must be timing out and asserting that it is done. However, there is still some glitchiness in the driver's timeout behaviour.

image

As seen in the image above, the AMDS will return valid data and not timeout, until it is unplugged. After unplugging, reading the debug counters gives some interesting results.

The valid counter freezes, which is expected behaviour. Both timeout counters are going, which is expected behaviour as well. However, the image shows that the data line 0 corrupt counter does not stop counting, while the data line 1 corrupt counter is correctly frozen. (in some tests, it was the data line 1 corrupt counter still running, and the data line 0 corrupt counter correctly frozen).

If a data line's corrupt counter is not frozen, that means that the state machines in adc_uart_rx.v and uart_rx.v are not remaining in the IDLE state and are instead still trying to read bytes, which further implies that somehow there is a falling edge on the data line which is starting the state machines... but the data line is unplugged and no falling edges can be seen on the scope... weird.

For now, since the "good operating condition" is working as expected, I will proceed with merging together my work with Annika's so that we can get a mosty-working pre-release of v1.3 to Anirudh. This issue can be further debugged once Anirudh is up and running. Just don't unplug the AMDS you are trying to use 🙃.

codecubepi commented 5 months ago

Also see:

codecubepi commented 5 months ago

A note on this issue after debugging BP4 with Anirudh... I think two additional things would be good to add:

  1. Only enabled AMDS sensors increase their timeout counters. Currently, all AMDS sensors receive the trigger signal from the timing manager in the FPGA, including those which are not enabled. Because they are not enabled (and obviously not plugged into an AMDS) they will of course timeout, and increment their counters. This is a little silly, if a given AMDS driver is not even enabled, there's not really a reason for it to increase its timeout counter.
  2. I think it would be nice to separate the timeout counter into two separate counters: one for a trigger timeout (if a trigger is received but the AMDS never responds at all, determined in adc_uart_rx.v) and a packet/byte timeout (a timeout midway through receiving data packets, determined in uart_rx.v).
codecubepi commented 3 months ago

@codecubepi

Document here error conditions and how to determine what the error condition is based on what the debug counters are doing

codecubepi commented 3 months ago

A note on this issue after debugging BP4 with Anirudh... I think two additional things would be good to add:

  1. Only enabled AMDS sensors increase their timeout counters. Currently, all AMDS sensors receive the trigger signal from the timing manager in the FPGA, including those which are not enabled. Because they are not enabled (and obviously not plugged into an AMDS) they will of course timeout, and increment their counters. This is a little silly, if a given AMDS driver is not even enabled, there's not really a reason for it to increase its timeout counter.
  2. I think it would be nice to separate the timeout counter into two separate counters: one for a trigger timeout (if a trigger is received but the AMDS never responds at all, determined in adc_uart_rx.v) and a packet/byte timeout (a timeout midway through receiving data packets, determined in uart_rx.v).

Progress made on this today! I have added logic so that only AMDS sensor drivers enabled in the timing manager are truly triggered (and thus timeout, increment their timeout counters, etc).

I renamed the debug counters slightly, so now we have byte valid, corrupt, and timeout counters, and well as a new data timeout counter (timeout from the trigger to the first data packet).

I made some modifications to the logic in amdc_amds_v1_0_S00_AXI.v as well which I believe has solved the issue of the data timeout and byte timeout/corrupt counters ramping simultaneously. I think single-cycle delay added checking signals that had been propogating through flop chains was erroneously kicking off the child state machine in uart_rx.v, which increments the byte counters.

See the following GIF and graph showing how the debug counters respond in real time to the AMDS being repeatedly plugged-in and unplugged (shoutout to @npetersen2 for the Python wizardry). You can also see that the counters do not start until the AMDS interface is enabled in the timing manager.

DebugCounters

image

I'll add a new/update this comment when the resolving code is pushed, and close the issue so we can merge into v1.3-staging before the v1.3 release: #372.