cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Two LM active test case #105

Open gladky opened 7 years ago

gladky commented 7 years ago

Two LM are satisfied here:

http://daq-expert.cms/daq2view-react/index.html?setup=cdaq&time=2017%2009%2001%2013:56:15

Current output

RU waiting for other FED

RU ru-c2e14-16-01.cms is stuck waiting for FED [1106 and 5 more]. Problem FED(s) belong(s) to partition HBHEB in HCAL subsystem. This causes backpressure at FED 1386 in other FED-builder EMTFOMTF of subsystem TRG. Note that there is nothing wrong with backpressured FED 1386.

RUs failed

1 RUs (ru-c2e12-40-01.cms) are in failed state for an unidentified reason. The most often occurring ((1 times) error message is: Caught exception: exception::DataCorruption 'Received a corrupted event 533775 from FED 1111 (HCAL): FED header "eventid" 681282 does not match the eventNumber found in FEROL header, and FED header "sourceId" 226 does not match the FED 1111 (HCAL) found in FEROL header, and inconsistent event size: FED trailer claims 1728 Bytes, while sum of FEROL headers yield 1736. In addition, the FED trailer indicates that the FED id is not expected by the FEROL (FED trailer F bit is set), and wrong slink C

Expected output

What is the expected output?

mommsen commented 7 years ago

The RU failure is the primary cause, which causes back pressure from ru-c2e14-16-01 to FED 1386. However, the message on 'RU waiting for other FED' is messed up: this RU does not read out FED 1106. This FED is read out by the failed RU on ru-c2e12-40-01.

gladky commented 7 years ago

@mommsen Please let me know what do you suggest to improve this message. Here is the test case where no other problem is interfering.

http://daq-expert.cms/daq2view-react/index.html?setup=cdaq&time=2017-06-21-19:48:16

And here is corresponding DAQExpert problem description:

RU ru-c2e14-16-01.cms is stuck waiting for FED [1280 and 107 more]. Problem FED(s) belong(s) to partition [BPIXP and 3 more] in PIXEL subsystem. This causes backpressure at FED 1386 in other FED-builder EMTFOMTF of subsystem TRG. Note that there is nothing wrong with backpressured FED 1386.

I understand that indicated RU (c2e14-16-01) does not read the problematic FED (1280, ..). The intention was to indicate that we see the backpressure in one of the FEDs (1386) in this RU because of problem in FEDs in other RU.

gladky commented 7 years ago

For the question of which LM we display. Currently RuFailed has lower usefulness than all backpressure subcases, including RuWaitingForOthers. This means that for this particular test case we will (and we have) display wrong conclusion. There are two solutions:

  1. raise RuFailed usefulness - is it always true? Is RuFailed more useful than RuWaitingForOthers in all cases?
  2. modify the conditions of RuWaitingForOthers not to take into account FEDs of RU that is in failed state.

Please let me know what's your opinion on this.

mommsen commented 7 years ago

I think RuFailed will always lead to RuWaitingForOthers. Thus, raising the RuFailed usefulness is the right way to go.

gladky commented 7 years ago

If we know this I would opt to avoid generating 2 conditions and use this knowledge to modify RuWaitingForOthers to not look for the problematic FEDs in RUs that are in failed state. I've checked this solution agains all related test-cases and it's passing.