Closed andreh12 closed 6 years ago
For the future, if we want the DAQExpert
investigate in more detail in such cases I see that I can reproduce the number of 'HLT alerts' shown in F3mon for this run by doing:
curl -XGET 'http://es-cdaq.cms:9200/hltdlogs_cdaq/_count/?q=category:StdException+AND+run:312816'
(maybe @smorovic can confirm whether this query is ok ?)
However, with the current separation between DAQAggregator
and DAQExpert
we would have to run the above query independently of whether there is an error condition or not (i.e. roughly every 3 seconds or we could run it e.g. every 30 seconds and just keep the value of the last retrieval in subsequent DAQExpert
snapshots).
Hi Andre, thanks for creating this issue. Was the reason why expert didn't identified this case the lack of data in snapshot? Please not that we are already querying f3mon for each snapshot to get hlt event rate, bandwidth, output disk and ramdisk usage.
Hello Maciej,
I don't know yet but I will investigate in the coming days with the given snapshot (hence the assignment of this ticket to me).
Thanks for the information about querying f3mon, I'll check also how this is implemented and if this can be extended to get the number of HLT error messages in the current run.
I created a test for this case here: https://github.com/cmsdaq/DAQExpert/compare/dev...andreh12:feature/test-issue-170?expand=1 (just for reference, not to be merged).
The test does not detect the problem because the field DAQ.hltInfo
is null
in the DAQExpert
. This happens because the version of the DAQAggregator
running for cDAQ is 1.17.6 which does not have this field. On deserialization of the snapshot in the DAQExpert
the field is simply deserialized as null.
In principle the information about the crashing CMSSW processes is in the current snapshots and the code could be made backward compatible but a better solution is to deploy a more recent version of the DAQAggregator
which fills DAQ.hltInfo
.
@andreh12,
query looks good. With different syntax (one I tend to use) I get the same result: curl -XGET 'http://es-cdaq.cms:9200/hltdlogs_cdaq/_search' -d'{"query":{"bool":{"must":[{"term":{"category":"StdException"}},{"term":{"run":312816}}]}},"size":0}'
{"took":2,"timed_out":false,"_shards":{"total":2,"successful":2,"failed":0},"hits":{"total":4725,"max_score":0.0,"hits":[]}}
I'm not sure if StdException will catch any possible problem, buy you can also search for only cmsswlog type to avoid getting hltd errors. Only fatal level are written to this index from CMSSW, either exceptions or crash-related output. curl -XGET 'http://es-cdaq.cms:9200/hltdlogs_cdaq/cmsswlog/_search'
thanks @smorovic for the queries -- we'll keep it as a reference when we decide to retrieve more details about massive HLT failures.
The original issue (DAQExpert
not detecting massive HLT failures) should be fixed with the deployment of DAQAggregator
version 1.17.11
today.
yesterday evening there was a problem with Frontier and all BUs went into
failed
state e.g during run 312816.The DAQExpert identified this as
unidentified problem
, see e.g. http://daq-expert.cms/DAQExpert/?start=2018-03-26T17:24:01.146Z&end=2018-03-26T17:29:03.170Z , I will check why none of the HLT related modules fired.An example snapshot is
/daqexpert/snapshots/pro/cdaq/2018/3/26/17/1522085175236.json.gz
.