all BUs in failed state led to unidentified problem

cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator

1 stars 2 forks source link

all BUs in failed state led to unidentified problem #170

Closed andreh12 closed 6 years ago

andreh12 commented 6 years ago

yesterday evening there was a problem with Frontier and all BUs went into failed state e.g during run 312816.

The DAQExpert identified this as unidentified problem, see e.g. http://daq-expert.cms/DAQExpert/?start=2018-03-26T17:24:01.146Z&end=2018-03-26T17:29:03.170Z , I will check why none of the HLT related modules fired.

An example snapshot is /daqexpert/snapshots/pro/cdaq/2018/3/26/17/1522085175236.json.gz .

andreh12 commented 6 years ago

For the future, if we want the DAQExpert investigate in more detail in such cases I see that I can reproduce the number of 'HLT alerts' shown in F3mon for this run by doing:

curl -XGET 'http://es-cdaq.cms:9200/hltdlogs_cdaq/_count/?q=category:StdException+AND+run:312816'

(maybe @smorovic can confirm whether this query is ok ?)

However, with the current separation between DAQAggregator and DAQExpert we would have to run the above query independently of whether there is an error condition or not (i.e. roughly every 3 seconds or we could run it e.g. every 30 seconds and just keep the value of the last retrieval in subsequent DAQExpert snapshots).

gladky commented 6 years ago

Hi Andre, thanks for creating this issue. Was the reason why expert didn't identified this case the lack of data in snapshot? Please not that we are already querying f3mon for each snapshot to get hlt event rate, bandwidth, output disk and ramdisk usage.

andreh12 commented 6 years ago

Hello Maciej,

I don't know yet but I will investigate in the coming days with the given snapshot (hence the assignment of this ticket to me).

Thanks for the information about querying f3mon, I'll check also how this is implemented and if this can be extended to get the number of HLT error messages in the current run.

andreh12 commented 6 years ago

I created a test for this case here: https://github.com/cmsdaq/DAQExpert/compare/dev...andreh12:feature/test-issue-170?expand=1 (just for reference, not to be merged).

The test does not detect the problem because the field DAQ.hltInfo is null in the DAQExpert. This happens because the version of the DAQAggregator running for cDAQ is 1.17.6 which does not have this field. On deserialization of the snapshot in the DAQExpert the field is simply deserialized as null.

In principle the information about the crashing CMSSW processes is in the current snapshots and the code could be made backward compatible but a better solution is to deploy a more recent version of the DAQAggregator which fills DAQ.hltInfo.

smorovic commented 6 years ago

@andreh12,

query looks good. With different syntax (one I tend to use) I get the same result: curl -XGET 'http://es-cdaq.cms:9200/hltdlogs_cdaq/_search' -d'{"query":{"bool":{"must":[{"term":{"category":"StdException"}},{"term":{"run":312816}}]}},"size":0}'

{"took":2,"timed_out":false,"_shards":{"total":2,"successful":2,"failed":0},"hits":{"total":4725,"max_score":0.0,"hits":[]}}

I'm not sure if StdException will catch any possible problem, buy you can also search for only cmsswlog type to avoid getting hltd errors. Only fatal level are written to this index from CMSSW, either exceptions or crash-related output. curl -XGET 'http://es-cdaq.cms:9200/hltdlogs_cdaq/cmsswlog/_search'

andreh12 commented 6 years ago

thanks @smorovic for the queries -- we'll keep it as a reference when we decide to retrieve more details about massive HLT failures.

The original issue (DAQExpert not detecting massive HLT failures) should be fixed with the deployment of DAQAggregator version 1.17.11 today.