cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Automatic recovery #169

Open gladky opened 6 years ago

gladky commented 6 years ago

Expert can automatically recover using L0 Automator and L0 FMs.

gladky commented 6 years ago

In first version 2 most frequent conditions will be automated:

This will give us the highest chance to see expert automatic recovery actions in coming days.

gladky commented 6 years ago

Please confirm that after translating the suggestions from manual to automatic everything is in order:

Note that there has been a markup syntax in action steps introduced that is interpreted by expert as something that could be automatised. This is transformed to readable form for displaying. The benefit of this approach is that we keep both human-readable and executable action definition in one place avoiding synchronisation problems.

FED stuck current recovery suggestion:

1. Stop the run
2. Red & green recycle the subsystem {{SUBSYSTEM}}.
3. Start new run (try up to 2 times)
4. Problem fixed: Make an e-log entry. Call the DOC of the subsystem {{SUBSYSTEM}} to inform
5. Problem not fixed: Call the DOC for the subsystem {{SUBSYSTEM}}"

FED stuck automated recovery procedure:

1. Try following up to 2 times
2. <<StopAndStartTheRun>> with <<RedRecycle::{{SUBSYSTEM}}>> and <<RedRecycle::{{SUBSYSTEM}}>>
3. same as 4 and 5 currently

Specific ECAL and TRACKER cases have slight modifications but nothing worth mentioning explicitly here.

Out of sequence data received current recovery suggestion:

1. Try to recover (try up to 2 times)
2. Stop the run. Red & green recycle the subsystem {{PROBLEM-SUBSYSTEM}}. Start a new Run
3. Problem not fixed: Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that caused the SyncLoss)
4. Problem fixed: Make an e-log entry. Call the DOC {{PROBLEM-SUBSYSTEM}} (subsystem that caused the SyncLoss) to inform about the problem

Out of sequence data received automated recovery procedure:

1. Try to recover (try up to 2 times)
2. <<StopAndStartTheRun>> with <<RedRecycle::{{PROBLEM-SUBSYSTEM}}>> & <<GreenRecycle::{{PROBLEM-SUBSYSTEM}}>>
3. same as 3 and 4 currently

Specific ECAL and TRACKER and FED 1111 cases have slight modifications but nothing worth mentioning explicitly here.

gladky commented 6 years ago

screen shot 2018-04-03 at 10 28 52

mommsen commented 6 years ago

Do we really want the shifter to initiate each recovery step individually? It might be fine for the first test, but I think on the longer run, we'd want to have just one button which executes all necessary steps. In any case, I think it should be made clear which steps have already been executed, e.g. by shading the corresponding steps.

BTW: I think there is still no way to automatically create elog entries, is it? We have to come up with a way to keep track of the recovery actions in the elog at least once they become fully automatic.