Test cases: verification after merging recent features

gladky commented 7 years ago

I'm merging few features to integration branch (fed-hierarchy, new backpressure analysis) and I need your help to verify some test cases. Please confirm that output of DAQExpert is correct for the following cases:

Case 1

http://daq-expert-dev.cms/daq2view-react/index.html?setup=testbed&time=2017-06-09-17:21:56

'Fed stuck': TTCP [BPIXP and 4 more] of ["TRACKER","PIXEL"] subsystem is blocking trigger, it's in ["BUSY","WARNING"] TTS state, The problem is caused by FED [(1204 behind pseudo FED 11200) and 39 more] in (FED has no individual TTS state, BUSY @ its pseudo FED)

'Ru failed' 17 RUs ([ru-c2e15-27-01.cms and 16 more]) are in failed state for an unidentified reason. The most often occurring ((1 times) error message is: Caught exception: exception::TCP 'Received a connection from 10.180.226.90:10107 while not accepting new connections' raised at connectionAcceptedEvent(/usr/local/src/xdaq/baseline14/trunk/daq/evb/include/evb/readoutunit/FerolConnectionManager.h:191)

Are both correct?

Case 2

http://daq-expert.cms/daq2view-react/index.html?setup=cdaq&time=2017-06-22-04:01:25

'FED stuck' : TTCP CSC+ of CSC subsystem is blocking trigger, it's in WARNING TTS state, The problem is caused by FED 838 in WARNING
'FEROL/FEROL40 FIFO stuck': FEROL of FED [1232 and 6 more] stopped sending fragments to its RU. This is likely a bug in the FEROL/FEROL40 firmware.

Is ferol stuck correct?

Case 3

http://daq-expert.cms/daq2view-react/index.html?setup=cdaq&time=2017-06-14-15:56:04

'Partition problem': Partition TIBTID in TRACKER subsystem is in OUT_OF_SYNC TTS state. It's blocking trigger.
'Corrupted data received' : Run blocked by corrupted data from FED [841,843] received by RU ru-c2e12-30-01.cms which is now in failed state. Problem FED belongs to partition CSC+ in CSC subsystem This causes backpressure at FED 1386 in partition MUTFUP of TRG
'RUs failed': 1 RUs (ru-c2e12-30-01.cms) are in failed state for an unidentified reason. The most often occurring ((1 times) error message is: Caught exception: exception::DataCorruption 'Received a corrupted event 1 from FED 843 (CSC): mismatch of FED id in FEROL header: expected FED 843 (CSC), but got 4095, and FED header "sourceId" 4095 does not match the FED 843 (CSC) found in FEROL header. In addition, the FED trailer indicates that the FED id is not expected by the FEROL (FED trailer F bit is set)' raised at reportErrors(/usr/local/src/xdaq/baseline14/trunk/daq/evb/src/common/readoutunit/FedFragment.cc:541)

Partition-problem and corrupted-data-received seem fine. But rus-failed seems redundant for this case.

mommsen commented 7 years ago

Case 1: the real reason is in the 2nd bullet. I guess the 1st one shows up because we do not see the backpressure from DAQ on the BPIX FEDs.

Case 2: this issue has been discussed in detail in the email thread "Fwd: ELOG : DAQ : Dump of FEROL40 with FED Id [1232 and 6 more] when blocking the run" on June 22/23. I think the conclusion is that we do not know if the FEROL40 was indeed stuck for a couple of seconds, or if there was a monitoring hiccup or anything else. I would keep the message as is for now and see if we can find another case.

Case 3: The message is indeed redundant. However, the RU error message gives the details about the corruption. Is it possible to add the error message to the 'Corrupted data received' text?

mommsen commented 7 years ago

BTW: if you want to make the RU/EVM/BU error messages better readable, one could parse away anything outside of the quotation marks, i.e. get rid of 'Caught exception: exception::DataCorruption' and ' raised at ...' stuff.

gladky commented 7 years ago

Case 1: So we can keep both and configure the usefulness parameter to deliver the best suggestion to shifter. Right know Ru-failed has lower usefulness than all Known-failures. In this case Fed-stuck will be the primary suggestion. Shell we switch the levels of usefulness? I opened the individual ticket to follow this: #85 If you have suggestions which problems should considered more important please add them.

gladky commented 7 years ago

Case1: Shall we switch the levels of usefulness? Is it correct to say that ru-failed will be always more important and more accurate than fed-stuck?

cmsdaq / DAQExpert