Closed andreh12 closed 6 years ago
Implemented in branch soft-error-count. We need to come up with recovery suggestions.
Example test case: http://daq-expert-dev.cms/DAQExpert/?start=2017-06-12T09:34:35.179Z&end=2017-06-12T09:40:26.099Z
our instructions on the twiki (CMS/ShiftNews) say: A single soft error recovery or a few in a row can be normal. You do not need to call the DOC about it. If a detector requests soft error recovery in an endless loop, the detector's DOC should be called.
for ES specifically, the instructions have:
ES may ask for consecutive SoftErrorRecovery every ~10 seconds. What to do:
- Stop the run and re-start it.
- If 1) doesn't work and DAQ is in the same condition as before, stop the run and red-recycle ES.
for Pixel, the instructions say:
when pixel goes repeatedly in SoftErrorRecovery check DCS. If problem in DCS (sectors turned off) ask DCS shifter to call Pixel DOC. If no problem in DCS call Pixel DOC immediately.
Thank you @andreh12 for these suggestions. In expert it will be implemented as conditional instructions. Here is the expert-ready version:
Default
ES subsystem
1) Stop the run and re-start it. 1) If 1) doesn't work and DAQ is in the same condition as before, stop the run and red-recycle ES.
Pixel subsystem:
1) Check DCS 1) If problem in DCS (sectors turned off) ask DCS shifter to call Pixel DOC 1) If no problem in DCS call Pixel DOC immediately.
Tracker subsystem:
1) Keep an eye on the situation 1) If soft error recovery continoues try to stop the run, red recycle the tracker, start a new run 1) It not fixed: call the tracker DOC
This is blocked by #3 and #64
The LM implementing this issue and 2 other ones related to fixing-soft error (#3 and #64) are implemented since mid June. The reason why it's not yet released is lack of suggestions for the two other cases.
implemented on the branch feature/soft-error-count
the shifter instructions have statements about SoftErrorRecovery loops (as you implemented for #65) but there seem to be no instructions (typically provided by subsystem experts themselves). However, I can't find any instructions what to do when SoftErrorRecovery takes too long is leads to a stuck state.
I would propose the following actions for both until we hear otherwise from the subsystem DOCs:
@mommsen , can you please comment on this proposal or upvote this comment if you agree ?
I would call the DOC of the subsystem immediately. If there is indeed a DCS problem, most likely the subsystem DOC is called anyway. If you want to foster the communication in the control room, one could say e.g. Call the DOC of subsystem {{SUBSYSTEM}}, and ask the DCS shifter to check the status of subsystem {{SUBSYSTEM}}
e.g. not more than 3 transitions into
SoftErrorRecovery
per subsystem in 5 minutes.counters should be kept per subsystem, information could be stored as timestamps in a queue of length 3 . Implies that a single snapshot is not sufficient to run tests with the corresponding code.
Currently the DAQAggregator may miss the state transition since the Level0 may fix the SoftErrorRecovery before the next snapshot is taken. A potential solution would require the Level0 to export counters how often each subsystem went into SoftErrorRecovery during the current run to a flashlist.