`checkSbitRateWithCalPulseLocal` fails with new firmwares (3.8.x - 3.2.x)

lpetre-ulb commented 5 years ago

Brief summary of issue

While trying to debug the Sbits in the new firmwares with the run_scans.py sbitMapNRate/checkSbitMappingAndRate.py commands, the scans systematically failed.

According to the CTP7 log file the issue is located inside calibration_routines.checkSbitRateWithCalPulse.

Types of issue

[x] Bug report (report an issue with the code)
[ ] Feature request (request for change which adds functionality)

Expected Behavior

The RPC method should perform seamlessly.

Current Behavior

The RPC method fails with the following errors in the CTP7 log:

Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Unmasking channel 14 on vfat 0 of OH 0
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Enabling calpulse for channel 14 on vfat 0 of OH 0
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Reseting trigger counters on OH & CTP7
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Configuring TTC Generator to use OH 0 with pulse delay 40 and L1Ainterval 0
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Entering ttcGenConfLocal
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: System release major is 3, v3 electronics behavior
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: ttcGenConfLocal: V3 behavior
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: ttcGenConfLocal: call ttcGenToggleLocal
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: System release major is 3, v3 electronics behavior
Jul 11 09:45:32 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Starting TTC Generator
Jul 11 09:45:33 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Reading trigger counters
Jul 11 09:45:33 eagle63 local0.err rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: read memsvc error: Bus error accessing 0x650080c8
Jul 11 09:45:33 eagle63 local0.err rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: read memsvc error: Bus error accessing 0x6500805c
Jul 11 09:45:33 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Stopping TTC Generator
Jul 11 09:45:33 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Disabling calpulse for channel 14 on vfat 0 of OH 0
Jul 11 09:45:33 eagle63 local0.err rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Reading reg 65400038 failed 1 times.
Jul 11 09:45:33 eagle63 local0.info rpcsvc[16714]: calibration_routines.checkSbitRateWithCalPulse: Masking channel 14 on vfat 0 of OH 0

The registers 0x650080c8 and 0x6500805c respectivelly correspond to and GEM_AMC.OH.OH0.FPGA.TRIG.CNT.CLUSTER_COUNT GEM_AMC.OH.OH0.FPGA.TRIG.CNT.

Steps to Reproduce (for bugs)

Launch a sbitMapNRate scan, e.g. run_scans.py sbitMapNRate 1 4 0x1 -r 1e3 -n 10
The scan fails
When the scan is running all OH FPGA register accesses fail (in gem_reg.py)
As soon as the scan is over the accesses succeed

Possible Solution (for bugs)

The sbitMapNRate scan is always first launched with a pulse rate of 0 Hz: https://github.com/cms-gem-daq-project/vfatqc-python-scripts/blob/dae6fb9a1d65f0d7081dc040832faf1c5f77123e/checkSbitMappingAndRate.py#L153-L166

In the ctp7_modules the L1A interval is then set a 0: https://github.com/cms-gem-daq-project/ctp7_modules/blob/92eeadc1993dfd85a19cd0a187f76eefb5029e06/src/calibration_routines.cpp#L976-L983

And the counters are read while the TTC Generator is running: https://github.com/cms-gem-daq-project/ctp7_modules/blob/92eeadc1993dfd85a19cd0a187f76eefb5029e06/src/calibration_routines.cpp#L1047-L1061

With the new firmware releases and the new 6b8b OH FPGA communication protocol the bandwidth is shared between TTC commands and slow control. Since TTC commands have higher priority slow control communication is impossible if L1A are sent at every clock cycle.

I would suggest to add a lower limit of the L1A interval (to be defined) in ttcGenConfLocal so that slow control communication is always possible. At the same time I would change the pulse rate of 0 in checkSbitMappingAndRate.py to 1.

Your Environment

Version used: https://github.com/cms-gem-daq-project/ctp7_modules/commit/92eeadc1993dfd85a19cd0a187f76eefb5029e06

bdorney commented 5 years ago

Out of curiosity did you try on 3.7.X and 3.1.5.*?

mexanick commented 5 years ago

I faced the very same issue on GE2/1 with CTP7 3.8.2 and artix-specific OH FW

bdorney commented 5 years ago

I would suggest to add a lower limit of the L1A interval (to be defined) in ttcGenConfLocal so that slow control communication is always possible. At the same time I would change the pulse rate of 0 in checkSbitMappingAndRate.py to 1.

Seems like the wrong repo for this comment. I think this issue here should be how to ensure a "0 Hz" is possible; e.g. how to change the code to allow this to be reliable. Changing the python tool should have an associated vfatqc issue.

lpetre-ulb commented 5 years ago

Out of curiosity did you try on 3.7.X and 3.1.5.*?

No I didn't; I can try these releases tonight.

I faced the very same issue on GE2/1 with CTP7 3.8.2 and artix-specific OH FW

Indeed, the issue is very likely to be caused by the new 6b8b protocol which is common to GE1/1 or GE2/1.

I would suggest to add a lower limit of the L1A interval (to be defined) in ttcGenConfLocal so that slow control communication is always possible. At the same time I would change the pulse rate of 0 in checkSbitMappingAndRate.py to 1.

Seems like the wrong repo for this comment. I think this issue here should be how to ensure a "0 Hz" is possible; e.g. how to change the code to allow this to be reliable.

Yes we can see it like that. Ensuring a "0 Hz" rate would mean disabling the TTC generator/not starting it. However I think that requesting an operation which overloads the FPGA e-link must still be forbidden or at least warned about.

Changing the python tool should have an associated vfatqc issue.

Indeed it was a last second though. But first I'm still trying to debug the Sbit mapping scan routine since the plots produced by the analysis tool are empty even if the scan does not report any error.

bdorney commented 5 years ago

However I think that requesting an operation which overloads the FPGA e-link must still be forbidden or at least warned about.

Yes, absolutely. This was definitely an oversight from my side.

Indeed it was a last second though. But first I'm still trying to debug the Sbit mapping scan routine since the plots produced by the analysis tool are empty even if the scan does not report any error.

Mmmhmm do you think this is due to the analysis tool or the data that is taken itself? If the analysis tool we should open an issue in gem-plotting-tools and try to debug it there.

lpetre-ulb commented 5 years ago

Mmmhmm do you think this is due to the analysis tool or the data that is taken itself? If the analysis tool we should open an issue in gem-plotting-tools and try to debug it there.

I think it is due to the data itself since the raw root file is "empty" (no valid Sbit, all rates are 0Hz, ...)

lpetre-ulb commented 5 years ago

Out of curiosity did you try on 3.7.X and 3.1.5.*?

I forgot to share the information. I tried on 3.7.X and 3.1.5.* this weekend and while there is no error reported by the script the plots and the ROOT file are still empty.

cms-gem-daq-project / ctp7_modules