DUNE-DAQ / minidaqapp

0 stars 1 forks source link

Split out the DQM functionality in a separate process from readout #98

Closed bieryAtFnal closed 2 years ago

bieryAtFnal commented 2 years ago

For reference, here are the changes that I made today on this branch (which, of course, built on the changes that Juan had already made):

  1. This was a bit of a hack, but I changed the NetworkEndpoints entries for fragsdqm{HOSTIDX} to use fragxdqm{HOSTIDX}. (frag"s" changed to frag"x"). This was a quick way to get around the fact that the dataflowgen.py script has a loop over all network endpoints that have "frags" in their name to create NetworkToQueue modules. That logic is fine for the dataflow app to receive Fragments from readout (frags{hostidx} and tpfrags{hostidx}) and Trigger (frags_tpsetds{idx}), but it causes the dataflow_gen.py script to create unwanted NetworkToQueue modules for Fragments which are supposed to go from the DataLinkHandlers to the DQM_TRB instances.
    • of course, the right way to fix this would be to modify the logic in the dataflow_gen.py script, but I wanted to minimize what I changed.
    • I believe that this was the cause of the problem that Juan mentioned when I started my tests. He saw that some, but not all, of the Fragments were not making it from the DLHs to the DQM_TRB. My sense is that this was because some of them were being sent to the NetworkToQueue module in the Dataflow app.
  2. I added if enable_dqm conditions around some of the dqm-related sections of mdapp_multiru_gen.py. These were needed because when I ran the configuration generation without the --enable-dqm command-line option, I saw errors.
  3. I changed the handling of the datareq_dqm_{idx} network endpoints. This included moving where they are defined (in mdapp_multiru_gen.py) and changing their contents.
    • for one thing, they were using host_df, when they should be using host_ru (both the DQM process and the Readout process on running on the same host, but those may be different than the DF host). We probably didn't see any problem caused by this in our development testing, because we were using the same host for RU and DF.
    • the other change that I made here may not have been needed (I now realize). This change was to create NetworkEndpoints for datareqdqm{RUHOSTIDX}{DataProducerIdx} instead of datareqdqm{TotalDataProducerIdx}. I'll run some tests once things quiet down and switch things back to the way that they were as a starting point for post-v2.8.2 development if that is appropriate. [Thinking about the indexing now, it shouldn't matter whether the network endpoints include
      • [datareq_dqm_0 .. datareq_dqm_14]
      • or
      • [datareq_dqm_0_0 .. datareq_dqm_0_4,datareq_dqm_1_0 .. datareq_dqm_1_4,datareq_dqm_2_0 .. datareq_dqm_2_4]
      • as long as everything is consistent
jmcarcell commented 2 years ago

Wow, I didn't understand why would you bother changing frags to fragx in 1 but now it makes sense. I don't know if I would have ever caught that one and I agree with you this is likely why I was seeing that not all the fragments were getting to the DQM_TRB.

On 3 I agree with you on that it doesn't matter if you use the new naming or the old one since it's a unique one for each either way, I see that you left the new one so I'll try to remember this in case we want to make other changes in the future that change this part.