apollo-lhc / cm_mcu

Microcontroller source code for the APOLLO blade for the CMS tracker HL-LHC upgrade.
MIT License
2 stars 2 forks source link

intermittent errors caused by readout of 12 channel, 25 Gbps FF #225

Closed pwittich closed 1 month ago

pwittich commented 1 month ago

Intermittent errors are reported by @rzouCERN and others in readout of the 12 channel ff via the CLI command

ff regr 16 3

e.g. This read out register at 0x16 on FF device 3. Error is that until you do a ff_reset the devices report errors like

20321596 MONI2C ERR MonitorTaskI2C_new.c:139:F1_3  12 Tx : page fail ADDR_ACK_ERROR

I was able to reproduce the issue on the board with the Segger and do some detailed debugging.

The board in question has the following load-out of FF

% ff_dump_names
ff_dump_names: ID registers
02:     ECUOT12251000513
03:     ECUOR12251000513
04:     ECUOT12251000513
05:     ECUOR12251000513
14:     CERNBY12024123M 
15:     CRRNBY12024123M 
19:     B0425040011201  

(I’m perennially confused why some of the CERN-B parts say CRRN-B…) Observations:

  1. the problem appears when I do the aforementioned command but not for all devices
  2. I can make the problem go away by doing ff_reset 1. This toggles the reset pin on all FF devices connected to F1.
  3. if I disable the readout of the FPGA's FF devices via the firefly user config stored in the EEPROM the problem appears when I restart the monitoring by resetting the user config
  4. all CLI code uses the following to read and write to the FF registers: https://github.com/apollo-lhc/cm_mcu/blob/094b3e24790fcca25c40d96a2ccac0d1dcd1387b/projects/cm_mcu/commands/SensorControl.c#L19 This is not used by the monitoring tasks.
  5. the code above does not clear the I2C mux when it is done
  6. In my test setup I can no longer reproduce the issue when I add mux clearing