Special instructions: corrupted data ECAL

gladky commented 6 years ago

From shifter bulletin:

1

If ECAL sends corrupted data to the DAQ (DAQExpert will warn about Corrupted data received) or causes a syncloss, try to recover by

stopping the run; red-recycling DAQ only; starting a new run.

only if this does not work, try stopping the run; red-recycling DAQ & ECAL; starting a new run.

Make a note in e-log explaining how you recovered the problem.

if this happens during physics data taking, do take the above actions first but then call the ECAL DOC (70130) at any time (no need to call outside physics data taking) (Suneel Dutt, 2017-07-15)

2

if ECAL is in syncLoss : try a run recovery w/o recycling first, then recycling if it does not succeed (Matthieu Marionneau, 04-09-2017)

gladky commented 6 years ago

We have 2 LMs related to this instructions:

OOS data received (+ legacy FC1)
Corrupted data received (+ legacy FC2)

ECAL specific instructions for corrupted data

Try to stop/start the run (Red recycle DAQ only)

If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem {{PROBLEM-SUBSYSTEM}}. Start new Run. (Try up to 2 times)

Problem fixed: Make an e-log entry. Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent corrupted data) to inform about the problem

Problem not fixed: Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent corrupted data)

ECAL specific instructions fo out of sequence data recieved

Try to stop/start the run",

If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem {{PROBLEM-SUBSYSTEM}}. Start new Run. (Try up to 2 times)",

Problem fixed: Make an e-log entry. Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent out-of-sync data) to inform about the problem",

Problem not fixed: Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent out-of-sync data data)

gladky commented 6 years ago

Is this instruction up to date?
shall we update the instructions for both corrupted-data and out-of-sequence to:

Try to stop/start the run (Red recycle DAQ only)

If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem {{PROBLEM-SUBSYSTEM}}. Start new Run.

Problem fixed: Make an e-log entry. If this happen during physics data taking call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sent corrupted data/out of sequence data) to inform about the problem

Problem not fixed: Call the DOC of {{PROBLEM-SUBSYSTEM}} (subsystem that sebt corrupted data/out of sequence data)

note that

(Try up to 2 times) was removed
If this happen during physics... was added
I did not add the number of ECAL doc, assuming they have it anyway, avoiding adding it here in case of possible future changes and the need to update it in expert

andreh12 commented 6 years ago

just to add my two cents on the phone numbers: in the future we could add an (external) configuration file with the map of subsystem to DOC phone numbers to show the phone numbers directly in the message (if people agree, we can open an issue for that but with low priority).

gladky commented 6 years ago

I found another special instructions that are related to ECAL (labeled as 2 in first comment). It seems to conflict to the 1st one.

Unless "try a run recovery w/o recycling first" == "recycle DAQ only"

gladky commented 6 years ago

Notes from @hsakulin

DAQ subsystem doesn't need to be red-recycled from RunBlocked
DAQ subsystem needs to be red-recycled from Error which is the case for currupted-data-received

We should avoid executing unnecessary recovery step of red-recycling DAQ where possible.

gladky commented 6 years ago

Contacted ECAL, for reference:

Hello Giacomo

Ecal is currently the only subsystem that requires red-recycle of DAQ subsystem in the special recovery instructions in case of syncloss problems.

Note that DAQ subsystem generally does not require a red-recycle from RunBlocked which is the case in syncloss problems. However, it does require red-recycle from Error state that is the case for corrupted data received problems.

In the shifter bulletin board I found:

If ECAL sends corrupted data to the DAQ (DAQExpert will warn about Corrupted data received) or causes a syncloss, try to recover by stopping the run; red-recycling DAQ only; starting a new run.

Is there a reason why you recommend to do the Red recycle of DAQ subsytem for syncloss problems?

Additionally could you please review the special instructions from the bulletin board for ECAL? We've extracted them to github issue:

https://github.com/cmsdaq/DAQExpert/issues/187

Reply

I think it is related to an issue we have seen while we are testing the new SLinks.

giacomoCucciati commented 6 years ago

As reported also in the email thread, we have new instructions in case of syncloss:

just stop the run and restart (no GR or RR).
if the issue is still there a RR of ECAL could help (but in the meanwhile it is better to call ECAL DOC)

gladky commented 6 years ago

In this case we will have following instructions.

ECAL corrupted data received

Try to stop/start the run (Red recycle DAQ only)

If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem ECAL. Start new Run.

Problem fixed: Make an e-log entry. If this happen during physics data taking call the DOC of ECAL (subsystem that sent corrupted data) to inform about the problem

Problem not fixed: Call the DOC of ECAL (subsystem that sent corrupted data)

ECAL syncloss

Stop/start the run

If this doesn't help: Stop the run. Red recycle the subsystem ECAL. Start new Run.

In the meanwhile call ECAL DOC

Problem not fixed: Call the DOC of ECAL

Note that:

@giacomoCucciati suggestion to call DOC will appear regardless of time - even outside of working hours and physics datataking
perhaps 3rd step it could be improved so it's more precise than "in the meantime", maybe while executing red recycle call ECAL DOC

@giacomoCucciati after you confirm the final form I will introduce this changes to expert system and move these instruction to new section in bulletin board "Covered by DAQExpert"

giacomoCucciati commented 6 years ago

The instructions are ok. Yes the point 3) can be improved and I would also add this information:

call ECAL DOC during the Red Recycle only if beam is not in RAMP mode

gladky commented 6 years ago

Final, confirmed version of ECAL special instructions:

ECAL corrupted data received

Try to stop/start the run (Red recycle DAQ only)

If this doesn't help: Stop the run. Red & green recycle both the DAQ and the subsystem ECAL. Start new Run.

Problem fixed: Make an e-log entry. If this happen during physics data taking call the DOC of ECAL (subsystem that sent corrupted data) to inform about the problem

Problem not fixed: Call the DOC of ECAL (subsystem that sent corrupted data)

ECAL syncloss

Stop/start the run

If this doesn't help: Stop the run. Red recycle the subsystem ECAL. Start new Run.

Call ECAL DOC during the Red Recycle (only if beam is not in RAMP mode)

Problem not fixed: Call the DOC of ECAL

gladky commented 6 years ago

Included in 2.13.0, moved to section "covered by expert"

cmsdaq / DAQExpert

Special instructions: corrupted data ECAL #187

1

2