cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Missing SUBSYSTEM in Lengthy fixing-soft-error #222

Closed mommsen closed 6 years ago

mommsen commented 6 years ago

The shifter reported tonight in the elog that the DAQExpert did not report the sub-system which is in lengthy fixing-soft-error:

Lengthy fixing-soft-error 2018-05-23 02:25:18
Level zero in FixingSoftError longer than 30 sec. This is caused by subsystem(s) {{SUBSYSTEM}}
gladky commented 6 years ago

Thank you for reporting. I will investigate it. http://daq-expert.cms/DAQExpert/?start=2018-05-23T00:24:53.281Z&end=2018-05-23T00:25:53.281Z

gladky commented 6 years ago

Here is what happened:

  1. There were 2 periods where L0 was in fixing soft error state, 1st 14 sec, 2nd 21 sec, between them there was ~35 msec Running state.
  2. Expert saw this as 1 period where L0 was in fixing soft error state - 35 sec
  3. The threshold for firing Lengthy fixing-soft-error was 30 sec. This means that 2:25:18 o'clock the condition was satisfied.
  4. The problem was that no system was in fixing soft error state at this moment (see the attached screenshot of run info timeline. More specifically ECAL was back in running but L0 still indicated fixing soft error This situation lasted for 6 seconds and includes transition of TCDS from Paused to Running (via TTCHardResetting and Resuming). This is a reason why DAQExpert could not fill the problematic SUBSYSTEM information in the message.

Why L0 was in fixing soft error even though there was no subsystem in this state?

If this is expected we need to update the logic of LM to include this assumption.

screen shot 2018-05-23 at 10 38 56

hsakulin commented 6 years ago

Soft error recovery always proceeds as follows: Pause TCDS, send fixSoftError to Subsystem(s), resume TCDS.

Always having a subsystem in fixingSoftError state is a wrong assumption.

On 23 May 2018, at 10:58, Maciej Gladki notifications@github.com wrote:

Here is what happened:

• There were 2 periods where L0 was in fixing soft error state, 1st 14 sec, 2nd 21 sec, between them there was ~35 msec Running state. • Expert saw this as 1 period where L0 was in fixing soft error state - 35 sec • The threshold for firing Lengthy fixing-soft-error was 30 sec. This means that 2:25:18 o'clock the condition was satisfied. • The problem was that no system was in fixing soft error state at this moment (see the attached screenshot of run info timeline. More specifically ECAL was back in running but L0 still indicated fixing soft error This situation lasted for 6 seconds and includes transition of TCDS from Paused to Running (via TTCHardResetting and Resuming). This is a reason why DAQExpert could not fill the problematic SUBSYSTEM information in the message. Why L0 was in fixing soft error even though there was no subsystem in this state?

If this is expected we need to update the logic of LM to include this assumption.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

gladky commented 6 years ago

The LM is now collecting the subsystems that were in FixingSoftError during the period where L0 was in FixingSoftError.

gladky commented 6 years ago

Fixed with 1bcef7a as 2.10.7