Check tracking data quality inside `testConnectivity.py`

lpetre-ulb commented 5 years ago

Brief summary of issue

Multiple VFATs show tracking data quality issue during data taking at QC8. One reason is that some VFATs start to send garbage data at some point. The CTP7 event builder is then unable to properly handle the VFAT events.

Investigations showed that the data packet header can be sent is absence of L1A and/or L1A are ignored. Moreover the tracking data packet CRC is wrong for these inconsistent events. The following two counters should help diagnose such issue :

GEM_AMC.OH_LINKS.OHx.VFATy.DAQ_EVENT_CNT;
GEM_AMC.OH_LINKS.OHx.VFATy.DAQ_CRC_ERROR_CNT

Types of issue

[ ] Bug report (report an issue with the code)
[x] Feature request (request for change which adds functionality)

Expected Behavior

Any obvious issue with the tracking data coming from a VFAT should be caught at the latest during the QC7 step.

Current Behavior

No dedicated tool exist to check the data quality based on the VFAT DAQ counters and the TTC generator.

However SCurves taken during the QC7 step already make heavy use of the tracking data. Since the SCurves already use the content of data and check that they make sense I question myself about the utility of this issue.

Maybe a simple check on GEM_AMC.OH_LINKS.OHx.VFATy.DAQ_CRC_ERROR_CNT once the SCurves are taken would be enough.

Also what do we expect to learn if the CTP7-parsed content of the tracking data is already good?

Context (for feature requests)

Allow seamless data taking during the QC8.

bdorney commented 5 years ago

So I agree a reset of these counters and then checking them after an scurve is probably not such a bad idea; but yes as you pointed out if the scurves come up good these counters are likely also to come up good.

So this does beg the question; why do we have these errors in the first place.

I think we can use the "bad" actor OH on QC8 (presently) OH8 to try to understand this issue; I agree it may be to premature to start development before we know about the failure mode.

jsturdy commented 5 years ago

Investigations showed that the data packet header can be sent is absence of L1A and/or L1A are ignored. Moreover the tracking data packet CRC is wrong for these inconsistent events.

Can you point me to "investigations" or provide a bit more context of what you mean by this statement?

I agree with @bdorney that we need to understand what causes these cases (or what the cases actually even are), where they ultimately come from (VFAT data formatter, VFAT transmission, GEB signal interference, other) before we can do much else... the only thing that one can really do is send a high rate of randoms for some fixed time or send the "usual" cosmic rate of randoms but for a long period of time (or both), and check the counters mentioned.

Probably also deeper raw analysis of the problem (and surrounding) events, and if they are recurrent, trying to correlate them with the CTP7 logs for symptomatic behaviours

lpetre-ulb commented 5 years ago

Can you point me to "investigations" or provide a bit more context of what you mean by this statement?

I might have had the wording wrong. I was especially referring to two emails sent by Evaldas and a discussion during the last Thursday DAQ meeting :

"Re: Strange behaviour of the QC8 stand DAQ during runs 137 ->139" on 02/07/2019 at 15:13
"Re: 3 DAQ issues noticed in analysing run000150 in the QC8 cosmic stand" on 28/06/2019 at 03:38

Regarding the the L1A issue I understood it comes from unrecognized tracking data packet headers but I may be wrong.

cms-gem-daq-project / vfatqc-python-scripts