cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Add hold-off before reporting high CPU load #186

Closed mommsen closed 6 years ago

mommsen commented 6 years ago

There have been several reports that the DAQExpert warns about a too high CPU load, e.g.:

HLT CPU load2018-05-07 08:40:41 23 s
HLT CPU load is high (( last: 92.9 %, avg: 96.6 %, min: 92.9 %, max: 99.6 %), which exceeds the threshold of 90.0%. Hide steps

    Call the HLT DOC, mentioning the HLT CPU load is high.

Called the HLT DOC -> short-term high load at beginning of run is expected

It is indeed expected that the CPU load is high at the start of the run during stable beams. I would suggest to add a hold-off of 1 minute before reporting the high CPU load.

Remi

andreh12 commented 6 years ago

as discussed with @gladky , we will implement this by keeping track of the timestamp when the RunOngoing last went from satisfied() = false to satisfied() = true in the HltCpuLoad module.

andreh12 commented 6 years ago

@mommsen is it ok to introduce a generic holdoff period since the beginning of the run or do you think we should apply the holdoff just for stable beams ?

mommsen commented 6 years ago

I think it is good enough to have a hold-off of 1 minute in any case. A short spike in CPU load does not cause any issues.

andreh12 commented 6 years ago

Thanks @mommsen for confirming.

One minute on the other hand might be optimistic, I've looked at a few recent cases:

But this is something which one can easily tune later.

andreh12 commented 6 years ago

added to release 2.10.6

andreh12 commented 6 years ago

with @mommsen we agreed to use expert.logic.hlt.cpu.load.holdoff.period = 180000 (3 minutes) in production

(added to the production configuration file but commented out for the moment)

gladky commented 6 years ago

Reopening to include @hsakulin suggestion:

Then we need two hold-offs:

E.g. 5 mins from the start of the run And 2 minutes from the start of the condition.

gladky commented 6 years ago

Fixed in 0914f77, released as 2.10.8