High deadtime due to FUs crashing

gladky commented 6 years ago

There was high deadtime due to FUs crashing. http://daq-expert-dev.cms/DAQExpert/?start=2017-10-01T17:59:13.039Z&end=2017-10-01T18:39:58.806Z

Around 20:20 we see that the number of crashed FUs is increasing. We could easily identify the problem by checking the percentage of FUs crashed against some threshold (Hannes suggested 1%)

http://daq-expert-dev.cms/daq2view-react/index_fff.html?setup=testbed&time=2017-10-01-20:19:37

The question is: could the problem be identified earlier than 20:20?

andreh12 commented 6 years ago

grepping for the first occurrence of numFUsCrashed in the corresponding json snapshots directory one can see that the number of crashed FUs steadily builds up for this case (first crashed FUs at 19:48:59 CEST [1506880139180.json.gz] ):

1506880130036.json.gz:     "numFUsCrashed" : 0,
1506880132242.json.gz:     "numFUsCrashed" : 0,
1506880134436.json.gz:     "numFUsCrashed" : 0,
1506880136757.json.gz:     "numFUsCrashed" : 0,
1506880139180.json.gz:     "numFUsCrashed" : 4,
1506880141276.json.gz:     "numFUsCrashed" : 8,
1506880143547.json.gz:     "numFUsCrashed" : 20,
1506880145644.json.gz:     "numFUsCrashed" : 40,
1506880148509.json.gz:     "numFUsCrashed" : 40,
1506880150851.json.gz:     "numFUsCrashed" : 44,
1506880153227.json.gz:     "numFUsCrashed" : 56,
1506880155429.json.gz:     "numFUsCrashed" : 64,
1506880158008.json.gz:     "numFUsCrashed" : 72,
1506880160383.json.gz:     "numFUsCrashed" : 92,
1506880162600.json.gz:     "numFUsCrashed" : 112,
1506880164808.json.gz:     "numFUsCrashed" : 124,
1506880167068.json.gz:     "numFUsCrashed" : 160,
1506880169843.json.gz:     "numFUsCrashed" : 212,
1506880172025.json.gz:     "numFUsCrashed" : 276,
1506880174245.json.gz:     "numFUsCrashed" : 344,
1506880176622.json.gz:     "numFUsCrashed" : 380,
1506880179002.json.gz:     "numFUsCrashed" : 452,
1506880181312.json.gz:     "numFUsCrashed" : 552,
1506880183568.json.gz:     "numFUsCrashed" : 680,
1506880185714.json.gz:     "numFUsCrashed" : 728,
1506880188480.json.gz:     "numFUsCrashed" : 896,
1506880190795.json.gz:     "numFUsCrashed" : 988,
1506880193122.json.gz:     "numFUsCrashed" : 1112,
1506880195245.json.gz:     "numFUsCrashed" : 1236,
1506880197928.json.gz:     "numFUsCrashed" : 1576,
1506880200238.json.gz:     "numFUsCrashed" : 1808,
1506880202451.json.gz:     "numFUsCrashed" : 2008,
1506880204663.json.gz:     "numFUsCrashed" : 2296,
1506880206769.json.gz:     "numFUsCrashed" : 2528,
1506880209366.json.gz:     "numFUsCrashed" : 2836,
1506880211751.json.gz:     "numFUsCrashed" : 3264,
1506880213952.json.gz:     "numFUsCrashed" : 3548,
1506880216142.json.gz:     "numFUsCrashed" : 3868,
1506880218508.json.gz:     "numFUsCrashed" : 4200,
1506880220859.json.gz:     "numFUsCrashed" : 4564,
1506880223051.json.gz:     "numFUsCrashed" : 4892,
1506880225208.json.gz:     "numFUsCrashed" : 5180,
1506880228037.json.gz:     "numFUsCrashed" : 5848,
1506880230348.json.gz:     "numFUsCrashed" : 6268,
1506880232708.json.gz:     "numFUsCrashed" : 6744,
1506880234844.json.gz:     "numFUsCrashed" : 7120,
1506880237043.json.gz:     "numFUsCrashed" : 7576,
1506880239970.json.gz:     "numFUsCrashed" : 8352,
1506880242318.json.gz:     "numFUsCrashed" : 8988,
1506880244596.json.gz:     "numFUsCrashed" : 9432,
1506880246822.json.gz:     "numFUsCrashed" : 9952,
1506880249235.json.gz:     "numFUsCrashed" : 10460,

taking the difference between the last and the first in this example one sees an increase in crashed FUs by 10460 in 112 seconds (out of 51464 total FUs).

Taking 1% as a threshold (515 FUs) the first snapshot which crossed this threshold is 1506880181312.json.gz corresponding to 19:49:41 local time.

smorovic commented 6 years ago

Hi @andreh12, first HLT crash actually appeared at 19:42 (form HLT error logs in elasticsearch/f3mon - this can be seen by reverting order of log sorting by HostTime in f3mon HLT log tab). "numFUsCrashed" is probably from quarantined FUs, which appear much later when 5 restarts / process start getting exhausted.

To detect problems earlier, DAQ Expert would need to look at incidence of crashes before anything gets quarantined. Currently we have a counter which stored per run in each of FU box info files, but isn't summarized in resource_summary files in ramdisk - ones parsed by BU application (also, there is no history of box files in elasticsearch).

What I would propose is to have this or similar variable (e.g. restart count divided by max number of restarts, maybe provided as fraction) in resource_summary files (aggregated for appliance), from where the information could be propagated to flashlists and snapshots used by the Expert.

andreh12 commented 6 years ago

thanks @smorovic for this information -- I guess this needs some changes changes by @mommsen in https://github.com/mommsen/evb before we can have this in the flashlist.

Depending on how long it is expected to take until we have it in production we could also first use numFUsCrashed and later switch to the new counter when it becomes available.

andreh12 commented 6 years ago

for the moment we propose to set the threshold at 20 or more CMSSW processes failed in a time window of 20 seconds (these values should be read from the configuration file).

smorovic commented 6 years ago

Hi @andreh12 ,

In elasticsearch there are now these variables: e.g. see one of resource_summary docs filled by BUs:

curl -XGET es-cdaq:9200/boxinfo_cdaq/resource_summary/_search?pretty -d'{"sort":{"fm_date":"desc"}}'

"active_resources"- available resources (either "idle" or "busy" on any run)
"active_resources_oldRuns" - busy resources used an old run. Useful in case all resources are unavailable because they still work or are styuck on an old run, but should be used with holdoff (probably 1-2 min) because it can be non-zero in run transition.
"quarantined" - quarantined resources. Normally FU moves all resources to quarantined only when all restart attempts are exhausted.
"activeRunHLTErr" - (float) error level in current run - number of crashes / (max number of restarts + 1 ) * ( cmssw_threads_per_process). Reaches value of max. number of resources /BU when all crash attempts have been exhausted. NB1: This will be reset to 0 when new run is started even if all resources are quarantined. But also we could use max("activeRunHLTErr","quarantined") to detect that the system is in bad state. NB2: cmssw_threads_per_process is 4 and number of restarts = 5 currently, so it's easy to get raw number of process restarts. These parameters haven't been changed for years.

I will create a PHP script to provide aggregated values summed over all BUs.

smorovic commented 6 years ago

Hi,

What I mentioned is now implemented in a script. Here's proxy link:

http://cmsdaqfff/prod/sc/php/resource_status.php?setup=cdaq ( direct one from .cms network: http://es-cdaq.cms/sc/php/resource_status.php?setup=cdaq)

On es-cdaq hosts the script source can be accessed here: /cmsnfses-web/es-web/prod/web/sc/php/resource_status.php /cmsnfses-web/es-web/test/web/sc/php/resource_status.php

Script returns result in this form:

{
"last_run":305782,
"num_BUs_with_last_run":73,
"quarantinedRes":0,
"crashedRes":0,
"crashedOrQuarantinedRes":0,
"crashes":0,
"activeRes":9056,
"activeResOldRuns":0
}

which contains everything I mentioned ("crashes" is the 'raw' unscaled process crash count). If there is no run in last 10 seconds, it just returns this:

{
"last_run":-1,
"num_BUs_with_last_run":0,
"quarantinedRes":-1,
"crashedRes":-1,
"crashedOrQuarantinedRes":-1,
"crashes":-1,
"activeRes":-1,
"activeResOldRuns":-1
}

Returning any actual value during no run is unreliable because hltd doesn't know BU blacklist and it must detect active BUs in the system by matching them to the latest run number.

Cheers, Srecko

gladky commented 6 years ago

released in 2.9.17

cmsdaq / DAQExpert

High deadtime due to FUs crashing #134