Severson-Group / AMDC-Firmware

Embedded system code (C and Verilog) which runs the AMDC Hardware
http://docs.amdc.dev/firmware
BSD 3-Clause "New" or "Revised" License
31 stars 5 forks source link

SensorCard samples seem to miss packets every so often #164

Closed npetersen2 closed 3 years ago

npetersen2 commented 3 years ago

When testing the new SensorCard data interface, mostly, the data interface works great. However, every once in a while, it seems that samples are dropped somewhere.

This issue serves as a log of me trying to figure out where.

This is not that same issue as #163! However, in #163, you can see this problem clearly, as seen below (image copied from other issue) in the 1st difference of the sampled data from a function generator:

image

npetersen2 commented 3 years ago

UART bytes over wire

Looking at raw UART data over wire... Are packets missed there??

Here, I attached a logic analyzer to the raw UART signals between the AMDC and the motherboard.

Here is a screenshot capturing ~10 seconds of data:

image

...Lots of data!

If we zoom in, we can see that data is sent over the UART OUT lines every time the UART IN2 has an edge (the firmware design I chose). The AMDC toggles this line every 100us, aka each 10kHz control update, it asks for new data. This is working fine; there are no gaps when it doesn't ask for data!

image

If we zoom in more into one single UART packet, we can see the decoded bytes. I wrote the code to use two wires to send 8 data packets (for 8 ADC channels). The header bytes 0x90 to 0x93 encode which ADC channel is coming in, 0..3. The next two bytes are the ADC data, 16 bit ADCs. This is working great. We can see the UART is indeed running at 25Mbaud. Fast!

image

NOW, we can get the logic analyzer program to give us a giant list of all the data bytes it found:

image

Analyzing UART bytes over wire

Pulling this into Python, we can parse these bytes offline and recreate the resulting data stream. FYI, the FPGA is doing this parsing in real-time; we are just recreating it here in python for debugging (i.e. to see if any data is dropped within the motherboard firmware itself).

After some python magic, we get a pandas dataframe of all 98k samples from a single channel over UART, ADC channel 0.

This is the timestamp which the data appears in the logic analyzer and the voltage the bytes encoded

image

You can see this is working great: the times are delta of 100us like we expect.

Plotting all 98k samples gives us a big blob:

image

Plotting a small window (0 to 10ms) proves this is indeed the 16V pk-pk (+8V to -8V) signal:

image

Proving the motherboard is not the issue

Now, we plot the delta of the raw data from above. This should show the same kind of crap we saw at the beginning of this post if the raw data itself is missing packets (i.e. the issue is on the motherboard itself)

image

This plot above shows the one diff() of the data, zoomed in to see many periods. Clearly, there are no missed packets in the data....

If we zoom out to all 98k samples, we should be able to see if there is any fuzzy outline to indicate missed packets:

image

And the answer is.... NO. There is a few little spurs, but basically, perfect data being sent from the motherboard over the UART wire. Therefore, the issue is NOT on the motherboard itself.

npetersen2 commented 3 years ago

The above comment proves that the motherboard it sending the right data off of it! However, it does not prove the right data is appearing at the AMDC board.... Maybe, in hardware, the packets are getting corrupted somehow? The UART is at 25Mbaud! And it has to go through isolators and various ICs. I will check if we are getting any corrupt data packets in the UART rx on the AMDC FPGA.

The AMDC FPGA IP for the motherboard keeps track of every single corrupt byte or timeout byte from the UART rx module. It keeps a counter. After running the AMDC and SensorCard for the last ~20 mins, here is the counter values

image

V means valid, and you can see it keeps changing => this is good! that means new valid data is constantly coming in! C = corrupt, T = timeout, these are both all 0s. This is good, means all good data!

If I unplug the motherboard from the AMDC:

image

Timeout and corrupt counters change, meaning they are working!

Therefore, my conclusion is that the UART interface between the boards is working fine. All the valid packets that we saw on the motherboard hardware are appearing in the FPGA and there, the registers....

Hmmm.

npetersen2 commented 3 years ago

Interesting. Solved it!

The issue was how I was requesting new data from the motherboard. I requested new data from the motherboard in the scheduler right before it started running all the new tasks for a given timeslice.

The issue was that, sometimes, the data wouldn't be at the AMDC yet by the time the task ran that wanted the data, therefore, it didn't have new data, thus double samples (i.e. it just used the last valid sample).

By requesting new data from the motherboard right AFTER all the tasks run for a timeslice, this gives much more time for it to arrive, thus solving the issue...

This is plot of 1 second of data (with one .diff()) from the motherboard (previous would have had the issue):

image

Now, the data timing is for sure one Ts late.... We can maybe fix this by requesting data at the beginning of the controller, and then waiting until it comes in... But, this approach of being 1 Ts late is okay for now.

npetersen2 commented 3 years ago

Just to emphasize, I recaptured that same plot above using the WRONG auto request data scheduler code (i.e it requests right before it runs the tasks).

image

Yucky.

npetersen2 commented 3 years ago

And lastly, I put the code to the correct implementation and sampled 10 seconds of data from all 8 motherboard channels at 10kHz, resulting in 100k samples.

Raw samples:

image

Raw samples + .diff() to show any errors:

image

FFT of raw data with 3 .diff() for all channels to look at noise floor:

image

The noise floor with lines at sine wave harmonics (data with 3 .diff()):

image

Lastly, when I turn off the function gen sine wave, here is the noise floor of 0V input (no .diff() applied, straight raw data):

This is an FFT of voltages, so I believe the magnitude of ~10e-4 means much less than 1mV of noise ripple (?)

image

npetersen2 commented 3 years ago

Closing this issue. Resolved.

elsevers commented 3 years ago

@npetersen2 Glad to hear you resolved this!

[If I understand this correctly]: should we have some kind of status register in the FPGA fabric to help handle this? I am thinking of a model of what I have seen with UARTs where a status flag is asserted when new data is available and this flag is cleared every time the data is read. This could be implemented alongside a FIFO. I realize that you would still have to do your sampling at the end of your task (because you just know that you don't have knew data isn't enough to make new data arrive), but from an error detection perspective, it may be nice to know if you are working with fresh data or not.

npetersen2 commented 3 years ago

@elsevers Yes, I agree. I actually already have a register bit in the firmware which indicates if the data is valid or not. Adding another bit like you describe above would be easy.

For high performance firmware, the idea is to use the forthcoming sys/controller.c type which runs the user controller code in a PWM-based ISR context. At the beginning of the controller code, the user should request new data from the motherboard. Then, they should busy poll the status register to wait until the new data arrive. Only then should they proceed and run their controller. This will ensure they have the latest data and that there is no 1 Ts delay in the sampling.

For low performance firmware, the user can just turn on the "auto request data" flag and the scheduler will request data after it runs all tasks. This ensures the data is always "fresh," but will be 1 Ts late. However, the user code does not have to worry about asking for new data!

npetersen2 commented 3 years ago

For future reference, this Jupyter notebook can be used to parse data from the AMDS that is sampled via a logic analyzer looking at the UART signal lines. This is to validate the samples being sent are valid.

This is a valid jupyter notebook! Rename it with the right extension to look at it. It is .txt so GitHub would let me upload it... AMDS_SampleParsing.txt