Split trgemu into trigger and hsi apps

philiprodrigues commented 3 years ago

This PR splits the functionality of the trgemu app (receive TimeSyncs, send TriggerDecisions and receive TriggerDecisionTokens) into separate trigger and hsi apps. The functionality of the hsi app comes from the timinglibs package, and the functionality of the trigger app is from the package of the same name. Eventually the hsi app will be connected to hardware that produces the signals that are turned into triggers, but it currently just generates signals at fixed time intervals.

The testing that I've done so far is very basic: two readout units slowed down by factor 10, produce output events in the HDF5 file

floriangroetschla commented 3 years ago

Tested it at full speed with 10 links and a trigger rate of 10Hz in the lab and seems to work nicely

bieryAtFnal commented 3 years ago

I ran a few tests, and the results of those looked good.

A few minor questions/comments:

I needed to set env vars DUNEDAQ_ERS_DEBUG_LEVEL and DUNEDAQ_ERS_VERBOSITY_LEVEL in my environment. This wasn't the case with the earlier mdapp_multiru_gen.py. Is this something that we'll now need to do?
I noticed that some of the opmon metrics that we we're accustomed to from the TDE seem to have gone away. Can those be added to the TriggerApp? (simple things like number of triggers)
Using the Queue monitoring functionality, I noticed that ~30 TimeSync messages are present in HSI Queues between runs, and ~10 Token messages are present in Trigger Queues between runs (the latter when I have artificial delays in the DataWriter). Those messages appear to get flushed(?) during the next run. I believe that this is totally expected, but I wanted to make note of it, for reference.

Also, I ran with 3 Readout Apps (6 processes overall), and that generally seemed fine. However, when I tried 4 runs in a single DAQ session, the number of TriggerRecords processed decreased with each run. It seemed like incomplete events were waiting for fragments in the FragmentReceiver. Maybe the problem was just that the computer that I was using couldn't handle the 15 links that I was requesting (5 per app, 3 apps). However, I had set the slowdown factor to 10... When I reduced the number of links to 2 per process, with 3 Readout processes, I did not see this behavior, so it seems to be related to the number of threads using CPU.

philiprodrigues commented 3 years ago

Thanks for the thorough testing, Kurt.

* I needed to set env vars DUNEDAQ_ERS_DEBUG_LEVEL and DUNEDAQ_ERS_VERBOSITY_LEVEL in my environment.  This wasn't the case with the earlier mdapp_multiru_gen.py.  Is this something that we'll now need to do?

I've added a commit to remove this requirement. (I actually find it quite useful to be able to control at least the debug level per nanorc run, but let's keep the changes in this PR minimal. We can consider putting this back when we see how https://github.com/DUNE-DAQ/nanorc/issues/8 gets resolved)

* I noticed that some of the opmon metrics that we we're accustomed to from the TDE seem to have gone away.  Can those be added to the TriggerApp?  (simple things like number of triggers)

@alexbooth92 is working on adding opmon metrics to the trigger modules, and I've opened https://github.com/DUNE-DAQ/trigger/issues/21 as a reminder of this particular instance.

* Using the Queue monitoring functionality, I noticed that ~30 TimeSync messages are present in HSI Queues between runs, and ~10 Token messages are present in Trigger Queues between runs (the latter when I have artificial delays in the DataWriter).  Those messages appear to get flushed(?) during the next run. I believe that this is totally expected, but I wanted to make note of it, for reference.

I just checked, and both queues are flushed at run start (for the token messages, the flushing is implicit: messages are just ignored if the message run number doesn't match the current run number). For the TimeSync messages, this is definitely fine. For the token messages, I'm having trouble working out whether it's OK. The tokens are "outstanding" because MLT receives "stop" before any of the DF modules, but MLT doesn't wait for all in-flight triggers to be completed before stopping. So the order is:

MLT receives "stop" while n triggers are in-flight
MLT stops
DF continues processing the n triggers, sends n tokens, but no one is listening
The tokens are consumed (and correctly ignored) by MLT at the start of the next run

Is there a case where this goes wrong and results in TriggerDecisions not being issued (because of lack of tokens) that should have been? I didn't succeed in inventing such a case, but that doesn't mean it doesn't exist...

Also, I ran with 3 Readout Apps (6 processes overall), and that generally seemed fine. However, when I tried 4 runs in a single DAQ session, the number of TriggerRecords processed decreased with each run. It seemed like incomplete events were waiting for fragments in the FragmentReceiver. Maybe the problem was just that the computer that I was using couldn't handle the 15 links that I was requesting (5 per app, 3 apps). However, I had set the slowdown factor to 10... When I reduced the number of links to 2 per process, with 3 Readout processes, I did not see this behavior, so it seems to be related to the number of threads using CPU.

I don't have any good ideas about what could be causing this, but it sounds like something that needs further investigation

DUNE-DAQ / minidaqapp

Split trgemu into trigger and hsi apps #42