DUNE-DAQ / minidaqapp

0 stars 1 forks source link

TRBuilder configration #36

Closed mroda88 closed 3 years ago

mroda88 commented 3 years ago

Here is the change in the configuration in order to use the TRBuilder. At the moment, this overrides the old configuration. I think here we might need to think if we want to keep supporting both separate and unified Module or just the unifided one.

Here is the list of things ready

Tests for single process:

Tests for 2 processes:

Tests for 3 processes:

mroda88 commented 3 years ago

In the one process configuration, with a lot of links, at the stop transition, I see

2021-Apr-27 15:37:43,295 WARNING [dunedaq::readout::ReadoutModel<RawType>::run_timesync(...) at /user/mroda/MyStore/Software/DAQ/TRBuilder/sourcecode/readout/src/ReadoutModel.hpp:242]  Failed attempt to write to the queue: timesync message queue. Data will be lost!
    was caused by: 2021-Apr-27 15:37:43,295 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_2/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_q: Unable to push within timeout period (timeout period was 0 milliseconds)

I tried to move around the order of the distribution of the stop command among the modules but the problem is still there. I suspect we need a finer transition with respect to what it is implemented now in the TRBuilder, but I don't have a plan now. I'm not even sure the solution should be looked in a different implementation or somewhere else. I suspect this will require some discussion.

eflumerf commented 3 years ago

The changes to the python files look reasonable. I think the plan is to move forward with the combined module, so I don't think we need to retain backwards-compatibility code.

bieryAtFnal commented 3 years ago

I found, and hopefully fixed, a minor issue with the "trb" module name in nanorc/dataflow_gen.py (RequestGenerator needed to be changed to TriggerRecordBuilder). Before this change, I couldn't get a 3-process system (one RU) to work, and after the change, I could get that to work.

Marco, please check that I did this correctly, when you get a chance.

mroda88 commented 3 years ago

Hi @bieryAtFnal yes tour changes are ok. Since I haven't tried yet the multi-process configuration, I only checked that the python syntax was correct. Thanks

bieryAtFnal commented 3 years ago

Hi Marco, yes, I recall that we talked about needing to go back and talk about TimeSync messages at some point in time. Maybe some of those issues will go away with the TimingApp (and the associated FakeTimingApp, if that is still in the plan), when it becomes available. We can/should talk about this, as you suggest.

In somewhat related news, I noticed that TriggerRecords at the end of a run can occasionally have missing Fragments. From the tests that I've done so far, I suspect that this issue pre-dated your TRBuilder changes, but that remains to be confirmed.

The test in which I noticed this was in a one-process system that had 10 links and no slowdown factor (factor == 1). The computer on which I'm running the test may not be keeping up, and that may be contributing to the problem.

bieryAtFnal commented 3 years ago

With a change to the order in which the Stop commands are sent to the processes, I haven't seen missing Fragments at end run (either with, or without, the TRBuilder change).
So, my sense is that the TRBuilder changes look good so far.

mroda88 commented 3 years ago

When we have a lot of links, I sometimes see things like

2021-May-05 17:34:21,751 WARNING [void dunedaq::readout::FakeCardReader::generate_data(dunedaq::appfwk::DAQSink<duneda
q::readout::types::WIB_SUPERCHUNK_STRUCT>*, int) at /nfs/home/maroda/DAQ/TRBuilder/sourcecode/readout/plugins/FakeCard
Reader.cpp:243]  Failed attempt to write to the queue: raw data input queue. Data will be lost! -- 10 similar messages
 suppressed, last occurrence was at 2021-May-05 17:34:21,751267
        was caused by: 2021-May-05 17:34:21,751 ERROR [void dunedaq::appfwk::FollyQueue<T, FollyQueueType>::push(duned
aq::appfwk::FollyQueue<T, FollyQueueType>::value_t&&, const duration_t&) [with T = dunedaq::readout::types::WIB_SUPERC
HUNK_STRUCT; FollyQueueType = folly::DSPSCQueue; dunedaq::appfwk::FollyQueue<T, FollyQueueType>::value_t = dunedaq::re
adout::types::WIB_SUPERCHUNK_STRUCT; dunedaq::appfwk::FollyQueue<T, FollyQueueType>::duration_t = std::chrono::duratio
n<long int, std::ratio<1, 1000> >] at /nfs/home/maroda/DAQ/TRBuilder/sourcecode/appfwk/include/appfwk/FollyQueue.hpp:6
0] wib_link_11: Unable to push within timeout period (timeout period was 0 milliseconds)

But this is on the readout side and not surprising since the timeout is 0. Apart from that, the tests went all ok using np04-srv-028.