DUNE-DAQ / trigger

Trigger infrastructure of the DUNE DAQ
0 stars 6 forks source link

MLT occasionally sends additional TriggerDecisions to the DFO after it completes the 'disable_triggers' transition #250

Closed bieryAtFnal closed 9 months ago

bieryAtFnal commented 11 months ago

This should not happen. The meaning of the 'disable_triggers' command/transition is to stop sending TriggerDecisions.

I suspect that this is caused by the MLT sending out 'pending' TDs at 'stop' time, and my sense is that it should instead do that at 'disable_triggers' time. But, I'll leave the full diagnosis and fix to the experts.

Here are TRACE messages that provide some insight into what is happening...

Below, in reverse time order, are TRACE messages that show the MLT pausing triggers (lines 13552-13554); the DFO completing the “stop” transition (line 13549); the MLT flushing pending TDs, sending a TD to the DFO, and completing the “stop” transition (lines 13451-13456), and the DFO complaining about a TD with the wrong run number at the start of the next run (line 13412).

Recall that run control commands are broadcast to all DAQ applications without any deliberate ordering. So, it is not surprising that the DFO could complete “stop” before the MLT does. That is exactly why the “disable_triggers” command/transition was added (to ensure that the flow of TriggerDecisions are stopped before the DFO is told to stop the run).

13412 11-23 20:17:46.058799 672746 673582 78 DataFlowOrchestrator ERR receive_trigger_decision: DataFlowOrchestrator encountered run number mismatch: recvd (101) != 102 from MLT for trigger_number 462 13418 11-23 20:17:46.058168 672746 673061 120 DataFlowOrchestrator LOG add_callback_impl: Registering callback. 13419 11-23 20:17:46.058004 672746 673061 120 DataFlowOrchestrator LOG add_callback_impl: Registering callback. 13451 11-23 20:17:43.217897 672967 673077 32 ModuleLevelTrigger NFO do_stop: End of run 101 13453 11-23 20:17:43.217178 672967 673077 32 ModuleLevelTrigger LOG do_stop: LivetimeCounter - total deadtime+paused: 2396 13454 11-23 20:17:43.216914 672967 673335 44 ModuleLevelTrigger LOG send_trigger_decisions: Run 101: Received 494 TCs. Sent 462 TDs consisting of 462 TCs. 29 TDs (29 TCs) were created during pause, and 0 TDs (0 TCs) were inhibited. 0 TDs (0 TCs) were dropped. 13455 11-23 20:17:43.211346 672967 673077 32 ModuleLevelTrigger LOG call_tc_decision: Sending a decision with triggernumber 462 timestamp 79554163409484103 number of links 20 based on TC of type 5 13456 11-23 20:17:43.211339 672967 673077 32 ModuleLevelTrigger D03 call_tc_decision: Override?: 1 13457 11-23 20:17:43.211326 672967 673077 32 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163409484103, request window begin: 79554163409484103, request window end: 79554163409484135 13458 11-23 20:17:43.211286 672967 673077 32 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13459 11-23 20:17:43.210549 672967 673077 32 ModuleLevelTrigger D03 flush_td_vectors: Flushing TDs. Size: 1 13516 11-23 20:17:42.295068 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0 13517 11-23 20:17:42.295065 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163405562183, request window begin: 79554163405562183, request window end: 79554163405562215 13518 11-23 20:17:42.295064 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13519 11-23 20:17:42.254728 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0 13520 11-23 20:17:42.254723 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163402746183, request window begin: 79554163402746183, request window end: 79554163402746215 13521 11-23 20:17:42.254705 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13537 11-23 20:17:42.204360 672967 673335 44 ModuleLevelTrigger D03 call_tc_decision: Override?: 0 13538 11-23 20:17:42.204357 672967 673335 44 ModuleLevelTrigger D03 create_decision: HSI passthrough: 0, TC detid: 15, TC type: 5, TC cont number: 1, DECISION trigger type: 1, DECISION timestamp: 79554163399930183, request window begin: 79554163399930183, request window end: 79554163399930215 13539 11-23 20:17:42.204342 672967 673335 44 ModuleLevelTrigger D03 create_decision: earliest TC index: 0 13549 11-23 20:17:42.194916 672746 673061 120 DataFlowOrchestrator LOG do_stop: dfo successfully stopped 13552 11-23 20:17:42.071710 672967 673077 99 ModuleLevelTrigger D03 do_pause: TS End: 1700792262071710 13553 11-23 20:17:42.071678 672967 673077 99 ModuleLevelTrigger NFO do_pause: Trigger is paused 13554 11-23 20:17:42.071589 672967 673077 99 ModuleLevelTrigger LOG do_pause: * Triggers PAUSED! in run 101 ***

bieryAtFnal commented 11 months ago

Apologies, I mistakenly called the transition in question 'pause-triggers' when I should have written 'disable_triggers'. I have edited existing text.

bieryAtFnal commented 9 months ago

For reference, I want to mention that Pull Request 249 was originally filed to address this Issue. However, the candidate code changes contained in that PR did not fix the problem. That PR has been closed, but I'm referencing it here because there may be comments in it that are helpful in understanding the underlying problem, should we ever need to go back and remember all of the details.

bieryAtFnal commented 9 months ago

Another comment that I would like to add to this Issue is that we have created an automated integration test that quite reliably demonstrates the problem. This integtest makes use of the configuration suggestions that Michal posted in PR 249, and it is available on the kbiery/integtest_for_the_disable_trigger_transition branch in the trigger repo.

MRiganSUSX commented 9 months ago

Fixed with above mentioned PR (https://github.com/DUNE-DAQ/trigger/pull/262).

bieryAtFnal commented 9 months ago

Closing this Issue now that it has been addressed.