cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Special instructions: corrupted data ECAL #187

Closed gladky closed 6 years ago

gladky commented 6 years ago

From shifter bulletin:

1

  • If ECAL sends corrupted data to the DAQ (DAQExpert will warn about Corrupted data received) or causes a syncloss, try to recover by
    • stopping the run; red-recycling DAQ only; starting a new run.
    • only if this does not work, try stopping the run; red-recycling DAQ & ECAL; starting a new run.
    • Make a note in e-log explaining how you recovered the problem.
    • if this happens during physics data taking, do take the above actions first but then call the ECAL DOC (70130) at any time (no need to call outside physics data taking) (Suneel Dutt, 2017-07-15)

2

  • if ECAL is in syncLoss : try a run recovery w/o recycling first, then recycling if it does not succeed (Matthieu Marionneau, 04-09-2017)
gladky commented 6 years ago

We have 2 LMs related to this instructions:

ECAL specific instructions for corrupted data

  1. Try to stop/start the run (Red recycle DAQ only)
  2. If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem {{PROBLEM-SUBSYSTEM}}. Start new Run. (Try up to 2 times)
  3. Problem fixed: Make an e-log entry. Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent corrupted data) to inform about the problem
  4. Problem not fixed: Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent corrupted data)

ECAL specific instructions fo out of sequence data recieved

  1. Try to stop/start the run",
  2. If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem {{PROBLEM-SUBSYSTEM}}. Start new Run. (Try up to 2 times)",
  3. Problem fixed: Make an e-log entry. Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent out-of-sync data) to inform about the problem",
  4. Problem not fixed: Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent out-of-sync data data)
gladky commented 6 years ago
  1. Is this instruction up to date?
  2. shall we update the instructions for both corrupted-data and out-of-sequence to:
  1. Try to stop/start the run (Red recycle DAQ only)
  2. If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem {{PROBLEM-SUBSYSTEM}}. Start new Run.
  3. Problem fixed: Make an e-log entry. If this happen during physics data taking call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent corrupted data/out of sequence data) to inform about the problem
  4. Problem not fixed: Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sebt corrupted data/out of sequence data)

note that

andreh12 commented 6 years ago

just to add my two cents on the phone numbers: in the future we could add an (external) configuration file with the map of subsystem to DOC phone numbers to show the phone numbers directly in the message (if people agree, we can open an issue for that but with low priority).

gladky commented 6 years ago

I found another special instructions that are related to ECAL (labeled as 2 in first comment). It seems to conflict to the 1st one.

Unless "try a run recovery w/o recycling first" == "recycle DAQ only"

gladky commented 6 years ago

Notes from @hsakulin

We should avoid executing unnecessary recovery step of red-recycling DAQ where possible.

gladky commented 6 years ago

Contacted ECAL, for reference:

Hello Giacomo

Ecal is currently the only subsystem that requires red-recycle of DAQ subsystem in the special recovery instructions in case of syncloss problems.

Note that DAQ subsystem generally does not require a red-recycle from RunBlocked which is the case in syncloss problems. However, it does require red-recycle from Error state that is the case for corrupted data received problems.

In the shifter bulletin board I found:

If ECAL sends corrupted data to the DAQ (DAQExpert will warn about Corrupted data received) or causes a syncloss, try to recover by stopping the run; red-recycling DAQ only; starting a new run.

Is there a reason why you recommend to do the Red recycle of DAQ subsytem for syncloss problems?

Additionally could you please review the special instructions from the bulletin board for ECAL? We've extracted them to github issue:

https://github.com/cmsdaq/DAQExpert/issues/187

Reply

I think it is related to an issue we have seen while we are testing the new SLinks.

giacomoCucciati commented 6 years ago

As reported also in the email thread, we have new instructions in case of syncloss:

gladky commented 6 years ago

In this case we will have following instructions.

ECAL corrupted data received

  1. Try to stop/start the run (Red recycle DAQ only)
  2. If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem ECAL. Start new Run.
  3. Problem fixed: Make an e-log entry. If this happen during physics data taking call the DOC of ECAL (subsystem that sent corrupted data) to inform about the problem
  4. Problem not fixed: Call the DOC of ECAL (subsystem that sent corrupted data)

ECAL syncloss

  1. Stop/start the run
  2. If this doesn't help: Stop the run. Red recycle the subsystem ECAL. Start new Run.
  3. In the meanwhile call ECAL DOC
  4. Problem not fixed: Call the DOC of ECAL

Note that:

@giacomoCucciati after you confirm the final form I will introduce this changes to expert system and move these instruction to new section in bulletin board "Covered by DAQExpert"

giacomoCucciati commented 6 years ago

The instructions are ok. Yes the point 3) can be improved and I would also add this information:

  1. call ECAL DOC during the Red Recycle only if beam is not in RAMP mode
gladky commented 6 years ago

Final, confirmed version of ECAL special instructions:

ECAL corrupted data received

  1. Try to stop/start the run (Red recycle DAQ only)
  2. If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem ECAL. Start new Run.
  3. Problem fixed: Make an e-log entry. If this happen during physics data taking call the DOC of ECAL (subsystem that sent corrupted data) to inform about the problem
  4. Problem not fixed: Call the DOC of ECAL (subsystem that sent corrupted data)

ECAL syncloss

  1. Stop/start the run
  2. If this doesn't help: Stop the run. Red recycle the subsystem ECAL. Start new Run.
  3. Call ECAL DOC during the Red Recycle (only if beam is not in RAMP mode)
  4. Problem not fixed: Call the DOC of ECAL
gladky commented 6 years ago

Included in 2.13.0, moved to section "covered by expert"