detect when a subsystem goes repeatedly into SoftErrorRecovery (ContinouslySoftError LM)

cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator

1 stars 2 forks source link

detect when a subsystem goes repeatedly into SoftErrorRecovery (ContinouslySoftError LM) #65

Closed andreh12 closed 6 years ago

andreh12 commented 7 years ago

e.g. not more than 3 transitions into SoftErrorRecovery per subsystem in 5 minutes.

counters should be kept per subsystem, information could be stored as timestamps in a queue of length 3 . Implies that a single snapshot is not sufficient to run tests with the corresponding code.

Currently the DAQAggregator may miss the state transition since the Level0 may fix the SoftErrorRecovery before the next snapshot is taken. A potential solution would require the Level0 to export counters how often each subsystem went into SoftErrorRecovery during the current run to a flashlist.

gladky commented 7 years ago

Implemented in branch soft-error-count. We need to come up with recovery suggestions.

Example test case: http://daq-expert-dev.cms/DAQExpert/?start=2017-06-12T09:34:35.179Z&end=2017-06-12T09:40:26.099Z

andreh12 commented 7 years ago

our instructions on the twiki (CMS/ShiftNews) say: A single soft error recovery or a few in a row can be normal. You do not need to call the DOC about it. If a detector requests soft error recovery in an endless loop, the detector's DOC should be called.

for ES specifically, the instructions have:

ES may ask for consecutive SoftErrorRecovery every ~10 seconds. What to do:

Stop the run and re-start it.

If 1) doesn't work and DAQ is in the same condition as before, stop the run and red-recycle ES.

andreh12 commented 7 years ago

for Pixel, the instructions say:

when pixel goes repeatedly in SoftErrorRecovery check DCS. If problem in DCS (sectors turned off) ask DCS shifter to call Pixel DOC. If no problem in DCS call Pixel DOC immediately.

gladky commented 7 years ago

Thank you @andreh12 for these suggestions. In expert it will be implemented as conditional instructions. Here is the expert-ready version:

Default

Call DOC of subsystem {{SUBSYSTEM}}

ES subsystem

1) Stop the run and re-start it. 1) If 1) doesn't work and DAQ is in the same condition as before, stop the run and red-recycle ES.

Pixel subsystem:

1) Check DCS 1) If problem in DCS (sectors turned off) ask DCS shifter to call Pixel DOC 1) If no problem in DCS call Pixel DOC immediately.

Tracker subsystem:

1) Keep an eye on the situation 1) If soft error recovery continoues try to stop the run, red recycle the tracker, start a new run 1) It not fixed: call the tracker DOC

gladky commented 7 years ago

This is blocked by #3 and #64

The LM implementing this issue and 2 other ones related to fixing-soft error (#3 and #64) are implemented since mid June. The reason why it's not yet released is lack of suggestions for the two other cases.

implemented on the branch feature/soft-error-count

andreh12 commented 7 years ago

the shifter instructions have statements about SoftErrorRecovery loops (as you implemented for #65) but there seem to be no instructions (typically provided by subsystem experts themselves). However, I can't find any instructions what to do when SoftErrorRecovery takes too long is leads to a stuck state.

I would propose the following actions for both until we hear otherwise from the subsystem DOCs:

Ask the DCS shifter to check the status of subsystem {{SUBSYSTEM}}
If the DCS status of {{SUBSYSTEM}} is as expected, call the DOC of subsystem {{SUBSYSTEM}}

@mommsen , can you please comment on this proposal or upvote this comment if you agree ?

mommsen commented 7 years ago

I would call the DOC of the subsystem immediately. If there is indeed a DCS problem, most likely the subsystem DOC is called anyway. If you want to foster the communication in the control room, one could say e.g. Call the DOC of subsystem {{SUBSYSTEM}}, and ask the DCS shifter to check the status of subsystem {{SUBSYSTEM}}