cms-gem-daq-project / reg_utils

0 stars 9 forks source link

[reliability] try to reduce/eliminate spurious 0xdeaddead values from rwreg #29

Closed jsturdy closed 5 years ago

jsturdy commented 6 years ago

Brief summary of issue

Checking the status of a set of VFAT registers, e.g., rwc OHX*VFAT*VFAT*ContReg0, often(always?) results in at least one or more 0xdeaddead read back values vfat_info_uhal.py only reports these when the chip is actually not present.

Types of issue

Expected Behavior

0xdeaddead should only be reported for truly disconnected hardware, single failed transactions should be eliminated insofar as that is possible

Current Behavior

eagle33 > rwc OH7*VFAT*VFAT*ContReg0                                                                                                                                                                                                                           
0x655c0000 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT0.ContReg0             0x03000000
0x655c0400 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT1.ContReg0             0x03010000
0x655c0800 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT2.ContReg0             0x03020037
0x655c0c00 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT3.ContReg0             0x03030037
0x655c1000 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT4.ContReg0             0x03040037
0x655c1400 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT5.ContReg0             0x03050037
0x655c1800 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT6.ContReg0             0x03060037
0x655c1c00 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT7.ContReg0             0x03070037
0x655c2000 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT8.ContReg0             0x03080037
0x655c2400 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT9.ContReg0             0x03090037
0x655c2800 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT10.ContReg0            0x030a0037
0x655c2c00 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT11.ContReg0            0x030b0037
0x655c3000 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT12.ContReg0            0x030c0037
0x655c3400 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT13.ContReg0            0x030d0037
0x655c3800 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT14.ContReg0            0x030e0037
0x655c3c00 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT15.ContReg0            0x030f0037
0x655c4000 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT16.ContReg0            0x03100037
0x655c4400 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT17.ContReg0            0x03110037
0x655c4800 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT18.ContReg0            0x03120037
0x655c4c00 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT19.ContReg0            0x03130037
0x655c5000 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT20.ContReg0            0xdeaddead
0x655c5400 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT21.ContReg0            0x03150037
0x655c5800 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT22.ContReg0            0x03160037
0x655c5c00 rw   top.GEM_AMC.OH.OH7.GEB.VFATS.VFAT23.ContReg0            0x03170037
[ ]% for (( oh=2; oh<=9; oh++ )); do vfat_info_uhal.py -s3 -g${oh}; done |fgrep ContReg0
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37

[ ]% for (( oh=2; oh<=9; oh++ )); do vfat_info_uhal.py -s3 -g${oh}; done |fgrep ContReg0
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37

[ ]% for (( oh=2; oh<=9; oh++ )); do vfat_info_uhal.py -s3 -g${oh}; done |fgrep ContReg0
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x37   0x37   0x37   0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x80   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x00   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37
   ContReg0::  0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37   0x37

In the vfat_info_uhal.py output, the non 0x37 values are real aberrations, currently under investigation, while the 0xdeaddead readback from gem_reg.py is a spurious failed transaction of some kind.

Context (for feature requests)

eagle33 is using the memhub transaction serializer, so collisions should be minimal, the primary failures likely coming from the timeout limitations of the CTP7 linux image on remote registers (fixed in a future linux image, FW not ported to this image)

This issue is to track any reliability issues that are solely on this side of that divide, assuming that the linux core image issues and overlapping transactions will be addressed in a more appropriate place.

mexanick commented 6 years ago

@jsturdy so you want to introduce transactions re-trying? The best place to implement this is in the librwreg, so I need the source code for the librwreg_memhub. BTW, why it is not here? Is it in a different repository? If so, please provide a reference and explanation, why it is not here.