cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Lengthy fixing-soft-error - TRACKER wants different threshold #223

Open gladky opened 6 years ago

gladky commented 6 years ago

From Elog (LOUIS JEAN MOUREAUX)

In case TRACKER appears to be stuck in FixingSoftError state DAQExpert will warn the DAQ shifter after 30 seconds, telling him/her to call the Tracker DOC Just discussed with the Tracker DOC: The delay is too short; one should wait for more than 1min Red-recycling the Tracker fixes the problem There is no need to call the DOC every time

Remi:

I think we should change the instructions ASAP. There have been multiple complaints in the elog.

gladky commented 6 years ago

Currently we have single global threshold that we define in expert properties file:

expert.logic.lenghtyfixingsofterror.threshold.period = 30000

One of the solution is to introduce specific TRACKER threshold.

expert.logic.lenghtyfixingsofterror.threshold.period.tracker = 60000

Another raise the threshold for everyone if you think that's appropriate. Please let me know what do you think.

andreh12 commented 6 years ago

I'm slightly inclined towards having subsystem-specific thresholds (as needed) even though it complicates things a bit.

gladky commented 6 years ago

@erikbutz could you please confirm this request and threshold proposed. I will then include this in next release.

erikbutz commented 6 years ago

a threshold of 60 or 70 seconds would indeed be preferable. We have slow control readings that access the control token rings during running and if the thread for this blocks the access the start of the soft error recovery will have to wait for it to finish.

in principle the magnitude of the problem is low (we had almost 3000 soft error recoveries since 2016 and only some 20 took more than 30 seconds), but we are taking a look at the recent spill of longer recoveries