cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

BackpressureFromEventBuilding not firing despite dead time #236

Closed mommsen closed 6 years ago

mommsen commented 6 years ago

Hi,

yesterday evening we had a case where the EvB was causing backpressure on FED 1386: http://daq-expert.cms/daq2view-react/index.html?setup=cdaq&time=2018-08-14-19:53:18

The DAQExpert did not diagnose that the backpressure was coming from the EvB, despite that the conditions were satisfied, i.e. FEDBuilders with backpressure to FEDs and 0 requests on ru-c2e14-16-01, 256 fragments in RU, and EVM has few (<100) requests. All BUs were enabled.

Do you trigger this diagnostics only when there is no rate? This should be triggered whenever there is backpressure from DAQ, imho.

Cheers, Remi

gladky commented 6 years ago

Thank you for reporting this case. The reason why it was not satisfied was that the conditions were oscillating around the threshold, some of them were satisfied for a duration of one snapshot while others weren't. In this case we were dealing with upgraded fed so we required TTS deadtime that should be greater than 2% - it wasn't, it was 1.78%.

Backpressure from EvB has following conditions:

[1] Deadtime due to DAQ has following conditions:

[2] upgraded FED problem has following conditions:

Note that few minutes later there was a short occurrence of Backpressure from FEROL

http://daq-expert.cms:8080/DAQExpert/?start=2018-08-14T17:56:19.129Z&end=2018-08-14T17:57:19.129Z

This situation will repeat given the conditions oscillating around thresholds. We could think of some solution that would prevent conditions from appearing and fading. The first thing that comes to my mind is firing the LM at given threshold X and keep them satisfied until a value drops below 0.5X. If you have other ideas please let me know.

mommsen commented 6 years ago

IMHO, we should drop the requirement on TTSDeadtime if a FED gets backpressured from DAQ. However, I see several instances where TTS is > 2%, e.g. http://daq-expert.cms/daq2view-react/index_fb_dt.html?setup=cdaq&time=2018-08-14-19:46:28

gladky commented 6 years ago

@mommsen, I identified another factor that prevented this condition from firing in this period. Since this is an upgraded FED the individual deadtime was not available. LM needed to verify which FEDs had deadtime and never found upgraded FEDs, I fixed that.

I run the new version of DAQExpert on this period. I received following result:

Backpressure from Event Building (i.e. not from HLT). Exists FEDBuilders with backpressure to FEDs 1386 (( last: 5.3%, avg: 6.9%, min: 5.3%, max: 8.5%)) and 0 requests on RU, 256 fragments in RU. EVM has few (( last: 0, avg: 0.5, min: 0, max: 1), the threshold is <100) requests. All BUs are enabled.

Depending on whether we turn of the TTSDeadtime requirement we have following number of occurrences:

mommsen commented 6 years ago

Great, thanks for finding this bug! Are these entries only from 19:40-20:00 on Aug 14, or did you find other instances, too?

Let's discuss with Hannes next week if we shall drop the TTSDeadtime condition in case that there is backpressure from DAQ.

Remi

gladky commented 6 years ago

Only from that evening. So for the time being I will prepare a new release to be deployed on the closest occasion. I leave this issue open to decide what we do with TTSDeadtime with @hsakulin

mommsen commented 6 years ago

@gladky, when do you plan to release a new version which includes this fix?

gladky commented 6 years ago

@remi i was traveling this weekend I will be back tomorrow. We could schedule it for tomorrow.

Cheers Maciej

On Tue, 21 Aug 2018, 9:23 a.m. Remi Mommsen, notifications@github.com wrote:

@gladky https://github.com/gladky, when do you plan to release a new version which includes this fix?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmsdaq/DAQExpert/issues/236#issuecomment-414576256, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtzC0xUnZi683xQDqWsj7JYbWxS321oks5uS7WMgaJpZM4V-C-O .

gladky commented 6 years ago

Introduced in 2.13.6

gladky commented 6 years ago

Related hotfix in 2.15.1 Last thing to do is to merge entries when interrupted by short monitoring fluctuations. Opening a separate issue for this #238