cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Wrong diagnostic in 'Bug in filterfarm' #115

Closed mommsen closed 6 years ago

mommsen commented 6 years ago

Hi,

I noticed that the LM 'Bug in filterfarm' has been trigger wrongly right now: http://daq-expert.cms/DAQExpert/?start=2017-09-14T15:04:58+02:00&end=2017-09-14T15:08:58+02:00

The problem is that all HLT processes on the FUs have failed, which causes most BUs to go into Failed state (plus a few Blocked.) Thus the logic should be augmented to also check if there are no BUs in Failed state (beside Blocked and Cloud.)

Is there actually a LM which would trigger when all FUs have crashed?

Cheers, Remi

andreh12 commented 6 years ago

I think we should first clarify what we mean by 'bug in filterfarm' (in particular as opposed to 'hlt problem').

The relevant code is around here: https://github.com/cmsdaq/DAQExpert/blob/306343dbea0dda860aeff0eb8040c9b726a25ea2/src/main/java/rcms/utilities/daqexpert/reasoning/logic/failures/backpressure/BackpressureAnalyzer.java#L276

The code counts counts the number of BUs in blocked or cloud state but not failed. Should failed be treated the same as blocked in this part of the code ?

(by the way, we had some discussion about the BU states in #32)

mommsen commented 6 years ago

In this case it is clearly a 'hlt problem', as all CMSSW processes on the FUs have crashed. I would treat 'Failed' here identical to Blocked or Cloud.

BTW: IIRC, we have seen cases where there no requests on the EVM, and where the BUs were causing backpressure as the FUs were no longer processing events, but have not failed. Maybe one could turn the logic around, e.g. in pseudo code:

if (rusWithManyRequests.size() == 0) {
  if (no BU Enabled) { // Here I assume that this code will only be triggered when a run is ongoing
    if ((BUs with (nb of FUs crashed > 0))>0) {
      return Subcase.HltProblem;
    } else {
      return Subcase.BugInFilterfarm;
    }
  }
}
andreh12 commented 6 years ago

I added a snapshot corresponding to the case mentioned in the original post by @mommsen (see the above github message).

In this snapshot we see:

    "numFUsHLT" : 0,
    "numFUsCrashed" : 14552,
    "numFUsStale" : 0,
    "numFUsCloud" : 6972,

I guess the module which should have fired for this instance would be HLTProblem ? @gladky and/or @mommsen can you confirm this ?

mommsen commented 6 years ago

Yep, this is definitely a HLT problem (:

andreh12 commented 6 years ago

by the way, #134 seems related (in #134 the crashes happen slowly over time while here the crashes happened at essentially the same time)

andreh12 commented 6 years ago

pull request #137 should fix this

andreh12 commented 6 years ago

Since the associated pull request has been merged, I'm closing this issue.