cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

detecting dead servers #237

Open andreh12 opened 5 years ago

andreh12 commented 5 years ago

As discussed today, we could look at timestamps of flashlists to check if PCs in the run are still alive or not.

For the recent case of the FRLpc, here is an example: /daqexpertflashlists/flashlists/pro/cdaq/JOB_CONTROL/2018/9/24/19/1537817871328.json.gz, the timestamp corresponds to Mon Sep 24 21:37:51 CEST 2018.

This file has the following data:

    "context" : "http://frlpc40-s2d19-41-01.cms:9999",
...
    "timestamp" : "2018-09-24T18:58:37.941695Z",

which is significantly older than the file timestamp while for another FRLpc there is:

    "context" : "http://frlpc-s1d06-07-01.cms:9999",
...
    "timestamp" : "2018-09-24T19:37:46.369696Z",

which is within a few seconds of the file timestamp.


The HOST_INFO flashlist seems to be essentially empty for this time.

Alternatively, the DISK_INFO also has timestamps and shows a similar lack of update for this FRLpc (see /daqexpertflashlists/flashlists/pro/cdaq/DISK_INFO/2018/9/24/19/1537817871328.json.gz).