Open gladky opened 7 years ago
The RU failure is the primary cause, which causes back pressure from ru-c2e14-16-01 to FED 1386. However, the message on 'RU waiting for other FED' is messed up: this RU does not read out FED 1106. This FED is read out by the failed RU on ru-c2e12-40-01.
@mommsen Please let me know what do you suggest to improve this message. Here is the test case where no other problem is interfering.
http://daq-expert.cms/daq2view-react/index.html?setup=cdaq&time=2017-06-21-19:48:16
And here is corresponding DAQExpert problem description:
RU ru-c2e14-16-01.cms is stuck waiting for FED [1280 and 107 more]. Problem FED(s) belong(s) to partition [BPIXP and 3 more] in PIXEL subsystem. This causes backpressure at FED 1386 in other FED-builder EMTFOMTF of subsystem TRG. Note that there is nothing wrong with backpressured FED 1386.
I understand that indicated RU (c2e14-16-01) does not read the problematic FED (1280, ..). The intention was to indicate that we see the backpressure in one of the FEDs (1386) in this RU because of problem in FEDs in other RU.
For the question of which LM we display. Currently RuFailed has lower usefulness than all backpressure subcases, including RuWaitingForOthers. This means that for this particular test case we will (and we have) display wrong conclusion. There are two solutions:
Please let me know what's your opinion on this.
I think RuFailed will always lead to RuWaitingForOthers. Thus, raising the RuFailed usefulness is the right way to go.
If we know this I would opt to avoid generating 2 conditions and use this knowledge to modify RuWaitingForOthers to not look for the problematic FEDs in RUs that are in failed state. I've checked this solution agains all related test-cases and it's passing.
Two LM are satisfied here:
http://daq-expert.cms/daq2view-react/index.html?setup=cdaq&time=2017%2009%2001%2013:56:15
Current output
RU waiting for other FED
RU ru-c2e14-16-01.cms is stuck waiting for FED [1106 and 5 more]. Problem FED(s) belong(s) to partition HBHEB in HCAL subsystem. This causes backpressure at FED 1386 in other FED-builder EMTFOMTF of subsystem TRG. Note that there is nothing wrong with backpressured FED 1386.
RUs failed
1 RUs (ru-c2e12-40-01.cms) are in failed state for an unidentified reason. The most often occurring ((1 times) error message is: Caught exception: exception::DataCorruption 'Received a corrupted event 533775 from FED 1111 (HCAL): FED header "eventid" 681282 does not match the eventNumber found in FEROL header, and FED header "sourceId" 226 does not match the FED 1111 (HCAL) found in FEROL header, and inconsistent event size: FED trailer claims 1728 Bytes, while sum of FEROL headers yield 1736. In addition, the FED trailer indicates that the FED id is not expected by the FEROL (FED trailer F bit is set), and wrong slink C
Expected output
What is the expected output?