DUNE-DAQ / fdreadoutlibs

fdreadoutlibs
0 stars 3 forks source link

Watching for late-arriving TPs in tp_datalinkhandler modules #150

Closed bieryAtFnal closed 7 months ago

bieryAtFnal commented 8 months ago

These changes, along with ones in the readoutlibs repo (PR 147), were added to help users be informed about situations in which TriggerPrimitives arrive at tp_datalinkhandler module instances too late to be sent downstream to the Trigger in TPSets.

This late-arrival behavior was noticed in automated integration tests. More details are available in these slides from the 06-Dec-2023 Core Software meeting.

The TP-arrival-time-monitoring code that is included in these changes does not implement the idea presented in those slides, but rather is based on a suggestion by Wes (and others?) at the meeting. That suggestion was to monitor the timestamps of TPs as they are added to the Latency Buffer, instead of periodically looking through the LB for missed TPs after they are added.

These changes have been verified to report warnings when the number of TP-based triggers fluctuates lower in integtests.

As part of these changes, new operational monitoring metrics have been added, and an existing metric has been renamed to better reflect the actual quantity that is being monitored.

It would be appreciated if reviewers would provide feedback on the implementation details of the 'watch TPs as they are added to the LB' model, in addition to all other aspects of the changes. Care was taken to try to avoid CPU intensive changes.

As discussed at the Core SW meeting, we should investigate other ways to handle late-arriving TPs at the tp_datalinkhandler, and I will file an Issue for that.

bieryAtFnal commented 8 months ago

To demonstrate the use of these changes, I typically use a locally-modified copy of daqsystemtest/integtest/3ru_1df_multirun_test.py (and/or 3ru_3df_multirun_test.py). The local modification is to reduce the value of the tpset_min_latency_ticks configuration parameter by a factor of 3 or more. A smaller minimum latency value increased the likelihood that the problem will occur.

For example:

swtpg_conf["readout"]["tpset_min_latency_ticks"] = 3375000  # was 9375000

With locally-modified copies of these integtests in a software area, the script that runs multiple integtests can be used to run the tests several times to try to tickle the problem. The problem doesn't happen all of the time, so it sometimes takes several runs before it appears.

For example:

daqsystemtest_integtest_bundle.sh -s 2 -f 2 -l 5 -N 3 --stop-on-fail
bieryAtFnal commented 7 months ago

After updating the names of the metrics that keep track of the various ways that TPs can be lost or dropped, I've re-run my tests of whether the values of such metrics add up to the expected values in tests which randomly pretend that there were failures, so I believe that this PR is ready. So, I'm going to merge it to develop.