Random invalid data in new AMDS driver?

codecubepi commented 1 month ago

Testing of the new AMDS driver has been going well, save for one problem(?).

When running the timing manager in automatic triggering mode (the default), the AMDS FPGA counters report invalid data (meaning that the parity check failed). The problem seems to get worse with lower and lower user ratios. It does not seems to occur with sufficiently high user ratio or in MANUAL triggering mode.

I can't seem to actually "capture" any invalid data, so I'm not entirely convinced this is a big deal, but it might require further testing.

codecubepi commented 1 month ago

Update:

After logging some signals using the new AMDS driver firmware, I get the following graphs:

This is logging one second of data for two channels. Channel 1 is a 10 Hz sine wave and Channel 2 is a 5 Hz skewed triangle wave. Both channels appear to have flawless data. However, I am still seeing that the "data is invalid" and the corrupt counter go up.

I am running the timing manager and AMDS driver with the following parameters: trigger on both PWM high and low, with a user ratio of 4. This is effectively toggling SYNC_ADC at 50 kHz, or every 20us, or as fast as the sensor can currently be triggered,

I am noticing a few strange things when I check the channel valid register. It's value is always either 0, 15, 240, or 255. These values mean the following:

Value (Decimal)	Value (Binary)	Meaning
0	0000_0000	All channels are invalid
15	0000_1111	Only channels 1-4 are valid
240	1111_0000	Only channels 5-8 are valid
255	1111_1111	All channels are valid

These valid bits are all cleared when a new set of packets comes in, and should be being set one by one, as each of four packets looks valid. But this is pretty suspicious, to only see these statuses for valid. There are a few possible explanations:

It could be that all valid bits are cleared right before the register is read in the C code (unlikely that this is repeatedly the case).
It could be that (for some strange reason), the first packet of the four on both data lines habitually causes an issue that causes that data line's state machine to quit, and then the following three packets on the data line never have a chance to be marked valid
I also have a check at the end of the adc_uart state machine that the packet counter should not be at some random value (outside 0-3). If it is some random value, all the valid bits are cleared in "panic". This check could probably be removed, it's not super necessary and could be causing unnecessary panics. If nothing else, it would be a good thing to remove for a debug sanity test. Also, the width of the packet counter could be reduced from 4 bits to 2 bits so that it will always be 0-3.

Regardless of the reason for this, we still have the issue with the corrupt counter. The corrupt counter is incrementing because the uart_rx module is telling its parent adc_uart module that either the header, data MSB, or data LSB is "corrupt", meaning it failed the parity check. But again, I don't actually see any evidence of truly corrupt data.

codecubepi commented 1 month ago

With help from @npetersen2 the issue has been solved! See commit 8cad1d6.

The issue was in the FPGA cycle delay caused by the hierarchical state machines in the driver. The amdc_amds_yada_AXI.v file double-flops the incoming data lines from the AMDS for metastability, and then detects a falling edge on the flopped data line to start the SM in adc_uart_rx.v, which handles the receving of the four packets (HEADER, MSB, and LSB).

That state machine starts the other state machine in uart_rx.v, which receives the individual UART bytes. After this state machine is started by the parent, it looks for the falling edge of the start bit.

The problem was that the sequential delay of the start signal being flopped through the state machine hierarchy meant that the start bit had sometimes already passed, so the uart_rx state machine was detecting a falling edge of a data bit instead, which caused a parity check fail and so on.

The solution was to flop the data line a few more times in uart_rx.v, to guarantee that the WAIT_FOR_START_BIT state was always entered before receiving the start bit:

https://github.com/Severson-Group/AMDC-Firmware/blob/8cad1d6708cc6569a32f72006c2b0e16b37996fb/ip_repo/amdc_amds_1.0/src/uart_rx.v#L34-L59

Results below:

codecubepi commented 1 month ago

Closing with merge of #394 into the staging branch.

Severson-Group / AMDC-Firmware

Random invalid data in new AMDS driver? #389