trgemu process seems to crash when given start command after ruemu-df process

bieryAtFnal commented 3 years ago

I ran a test with a two-process system and each of the processes was on a separate host.

Here is the confgen command that I used:

python -m minidaqapp.rudf_trg -n10 -s 10 -t 5 --host-ip-df 192.168.157.13 --host-ip-trigemu 192.168.157.14 -d /scratch/biery/data/frames.bin -o "." minidaq_2proc_noFelix
(I'm using two computers in the mu2edaq pilot cluster at Fermilab)

I used the stdin command handler to start the processes:

daq_application --name mdapp-df -c stdin://./minidaq_2proc_noFelix-ruemu_df.json
daq_application --name mdapp-trigemu -c stdin://./minidaq_2proc_noFelix-trgemu.json

In the test, I specify the init and conf commands to the ruemu-df process, then do the same for the trgemu process. Then, I give the start command to the ruemu-df process, wait ~10 seconds, and then give the start command to the trgemu process.

At that point, I see a crash of the trgemu process with a trace that looks like the following:

2021-Mar-12 13:50:20,919 ERROR [static void ers::ErrorHandler::terminate_handler() at /scratch/workdir/sourcecode/ers/src/ErrorHandler.cpp:135] Unhandled 'appfwk::QueueTimeoutExpired' exception has been thrown
    Parameters = 'name=appfwk::QueueTimeoutExpired' 
    Qualifiers = 'unknown' 
    host = mu2edaq14
    user = biery (2310)
    process id = 234860
    thread id = 234920
    process wd = /home/biery/dunedaq/12MarB
    stack trace of the crashing thread:
      #0  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/ers/v1_1_1/slf7.x86_64.e19.prof/ers/lib64/libers.so(+0x11e6f) [0x7f464faede6f]
      #1  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91db6) [0x7f464f7e9db6]
      #2  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91df1) [0x7f464f7e9df1]
      #3  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x92024) [0x7f464f7ea024]
      #4  /home/biery/dunedaq/12MarB/build/trigemu/plugins/libtrigemu_TriggerDecisionEmulator_duneDAQModule.so(dunedaq::appfwk::FollyQueue<dunedaq::dfmessages::TimeSync, folly::DSPSCQueue>::push(dunedaq::dfmessages::TimeSync&&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&)+0x280) [0x7f463e3c6000]
      #5  /home/biery/dunedaq/12MarB/build/dfmessages/plugins/libdfmessages_TimeSyncNQ_duneNetworkQueue.so(dunedaq::nwqueueadapters::NetworkToQueueImpl<dunedaq::dfmessages::TimeSync>::push()+0x82) [0x7f461b3b2d32]
      #6  /home/biery/dunedaq/12MarB/build/nwqueueadapters/plugins/libnwqueueadapters_NetworkToQueue_duneDAQModule.so(dunedaq::nwqueueadapters::NetworkToQueue::do_work(std::atomic<bool>&)+0x2c) [0x7f463e622bcc]
      #7  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0xbb02f) [0x7f464f81302f]
      #8  /lib64/libpthread.so.0(+0x7ea5) [0x7f46501acea5]
      #9  /lib64/libc.so.6(clone+0x6d) [0x7f464f2708dd]
    was caused by: 2021-Mar-12 13:50:20,918 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds)
Signal received: 6
Aborted (core dumped)

If I give the start command to the trgemu process first, then I can get beyond this issue.

glehmannmiotto commented 3 years ago

It looks as if the df application starts generating time syncs, that the network layer on the other side receives/buffers. When the trigemu starts the network module receives all time syncs and tries to push them to the queue. I see three issues: 1) the queues are probably very short 2) there is no try/catch around the push method in the nw adapter module 3) the behaviour of what we expect to happen for messages sent before an application is ready to take them is to be discussed; is the present behaviour ok? Should all messages available/queued for reception be dropped at start?

On 12 Mar 2021, at 21:00, bieryAtFnal @.***> wrote:

I ran a test with a two-process system and each of the processes was on a separate host.

Here is the confgen command that I used:

python -m minidaqapp.rudf_trg -n10 -s 10 -t 5 --host-ip-df 192.168.157.13 --host-ip-trigemu 192.168.157.14 -d /scratch/biery/data/frames.bin -o "." minidaq_2proc_noFelix (I'm using two computers in the mu2edaq pilot cluster at Fermilab) I used the stdin command handler to start the processes:

daq_application --name mdapp-df -c stdin://./minidaq_2proc_noFelix-ruemu_df.json daq_application --name mdapp-trigemu -c stdin://./minidaq_2proc_noFelix-trgemu.json In the test, I specify the init and conf commands to the ruemu-df process, then do the same for the trgemu process. Then, I give the start command to the ruemu-df process, wait ~10 seconds, and then give the start command to the trgemu process.

At that point, I see a crash of the trgemu process with a trace that looks like the following:

2021-Mar-12 13:50:20,919 ERROR [static void ers::ErrorHandler::terminate_handler() at /scratch/workdir/sourcecode/ers/src/ErrorHandler.cpp:135] Unhandled 'appfwk::QueueTimeoutExpired' exception has been thrown Parameters = 'name=appfwk::QueueTimeoutExpired' Qualifiers = 'unknown' host = mu2edaq14 user = biery (2310) process id = 234860 thread id = 234920 process wd = /home/biery/dunedaq/12MarB stack trace of the crashing thread:

0 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/ers/v1_1_1/slf7.x86_64.e19.prof/ers/lib64/libers.so(+0x11e6f) [0x7f464faede6f]
#1  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91db6) [0x7f464f7e9db6]
#2  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x91df1) [0x7f464f7e9df1]
#3  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0x92024) [0x7f464f7ea024]
#4  /home/biery/dunedaq/12MarB/build/trigemu/plugins/libtrigemu_TriggerDecisionEmulator_duneDAQModule.so(dunedaq::appfwk::FollyQueue<dunedaq::dfmessages::TimeSync, folly::DSPSCQueue>::push(dunedaq::dfmessages::TimeSync&&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&)+0x280) [0x7f463e3c6000]
#5  /home/biery/dunedaq/12MarB/build/dfmessages/plugins/libdfmessages_TimeSyncNQ_duneNetworkQueue.so(dunedaq::nwqueueadapters::NetworkToQueueImpl<dunedaq::dfmessages::TimeSync>::push()+0x82) [0x7f461b3b2d32]
#6  /home/biery/dunedaq/12MarB/build/nwqueueadapters/plugins/libnwqueueadapters_NetworkToQueue_duneDAQModule.so(dunedaq::nwqueueadapters::NetworkToQueue::do_work(std::atomic<bool>&)+0x2c) [0x7f463e622bcc]
#7  /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/gcc/v8_2_0/Linux64bit+3.10-2.17/lib64/libstdc++.so.6(+0xbb02f) [0x7f464f81302f]
#8  /lib64/libpthread.so.0(+0x7ea5) [0x7f46501acea5]
#9  /lib64/libc.so.6(clone+0x6d) [0x7f464f2708dd]
was caused by: 2021-Mar-12 13:50:20,918 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds) Signal received: 6 Aborted (core dumped) If I give the start command to the trgemu process first, then I can get beyond this issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DUNE-DAQ/trigemu/issues/22, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6SRU5TBGNCN5ZTAFYNXALTDJQDDANCNFSM4ZC3Y54A.

bieryAtFnal commented 3 years ago

Regarding point 1... Yes, if I increase the capacity of the time_sync_from_netq Queue in the *-trgemu.json file from 100 to 10000, that seems to help...

alessandrothea commented 3 years ago

Regarding point 3) one must make a distinction between data messages and status messages. The latter are not supposed to be queued by definition, as the only the last message is relevant. I remember we discussed this at some point. When does trigemu start draining the timesync queues? At configure or start?

alessandrothea commented 3 years ago

It looks like that the TimestampEstimator object only exists between start and stop, which explains why timesyncs are piling up. My suggestion is to create it at conf and reset it at scrap. I would also be good to publish the current estimated timestamp as with the monitoring information.

bieryAtFnal commented 3 years ago

The idea of somehow only keeping a most recent status message sounds good to me, but I don't know if we have examples of how to do that (with either Queues or Queues+NWQueueAdapters)...

philiprodrigues commented 3 years ago

I agree with all the analysis in the comments above. Not catching the push timeout exception in nwqueueadapters is definitely a bug, fixed in this PR, ready for testing:

https://github.com/DUNE-DAQ/nwqueueadapters/pull/8

Moving the timestamp estimation to "conf until scrap" would be ideal, but readout resets the timestamp at each new run start, which confuses the timestamp estimation logic. If we really need the estimation to work for "conf until scrap", we can look into ways to make the timestamp estimation logic more sophisticated, or look into changing readout to not reset timestamps at run start.

I don't think we have any way in the framework to handle "status" message streams for which only the most recent message is relevant, but at the application-logic level, we get somewhere near it: in TimestampEstimator, all the TimeSyncs on the input queue are popped, and the largest-timestamp one is used for the estimation.

alessandrothea commented 3 years ago

Moving the timestamp estimation to "conf until scrap" would be ideal, but readout resets the timestamp at each new run start, which confuses the timestamp estimation logic. If we really need the estimation to work for "conf until scrap", we can look into ways to make the timestamp estimation logic more sophisticated, or look into changing readout to not reset timestamps at run start.

My take is that readout should not reset the timing estimation at start...

bieryAtFnal commented 3 years ago

With the updated code in nwqueueadapters, I see a burst of messages like the following when the ruemu-df process is given the "start" command:

2021-Mar-15 08:52:35,555 WARNING [dunedaq::nwqueueadapters::NetworkToQueue::do_work(...) at /home/biery/dunedaq/14Mar/sourcecode/nwqueueadapters/plugins/NetworkToQueue.cpp:70] Push timed out: Message of type dunedaq::dfmessages::TimeSync to queue time_sync_from_netq
    was caused by: 2021-Mar-15 08:52:35,555 ERROR [FollyQueueType>::push(...) at /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/appfwk/v2_2_0/slf7.x86_64.e19.prof/appfwk/include/appfwk/FollyQueue.hpp:54] time_sync_from_netq: Unable to push within timeout period (timeout period was 10 milliseconds)

Is this OK/expected? It seems like it might be nicer to have no warnings.

The command sequence that I'm using in my tests is:

ruemu_df init
trgemu init
ruemu_df conf
trgemu conf
ruemu-df start

bieryAtFnal commented 3 years ago

Ignoring the warning messages, things seem to work. I can start the trgemu process, resume triggers, and see them written to disk. And, it seems like I can stop the first run and start a second one.

I don't know if this is related to Alessandro's point, but looking at the timestamps in the data files from the two runs, they go back to "2020-07-19 13:05:02" at the start of the second run, but that may just be a feature of the FakeCardReader.

philiprodrigues commented 3 years ago

The warnings are expected: they occur when the NetworkToQueue module can't push onto its output queue because the queue is full. So the code in NetworkToQueue that produces the warning is correct, but I agree it'll be nicer to not have the situation arise in the first place. That will require the changes to TimestampEstimator and readout that Alessandro suggested.

The FakeCardReader reads the first timestamp from its input data file and adds 25 ticks to each subsequent frame. The input data file was taken on 2020-07-19, which is why you see that date in the output file

philiprodrigues commented 3 years ago

Roland and I took a look at readout and it looks like it'll be nontrivial to change readout to have the desired behaviour. I've filed an issue in readout so we don't forget. For the current release, I think we will have to live with the TDE/readout behaviour as it is. We could make it not appear for end users by just making all of the queues much larger (but of course, for testing, small queues are great for helping surface issues like this one).

glehmannmiotto commented 3 years ago

Is it correct that at the present moment we can start the trigger before the rudf app, since triggers actually start flowing only at resume? This is a fully valid solution, no? Like we enforced an order in the modules.

On 15 Mar 2021, at 17:19, Philip Rodrigues @.***> wrote:

Roland and I took a look at readout and it looks like it'll be nontrivial to change readout to have the desired behaviour. I've filed an issue in readout so we don't forget. For the current release, I think we will have to live with the TDE/readout behaviour as it is. We could make it not appear for end users by just making all of the queues much larger (but of course, for testing, small queues are great for helping surface issues like this one).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DUNE-DAQ/trigemu/issues/22#issuecomment-799550516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6SRU4VN3IBIZCLQU5PGEDTDYXP5ANCNFSM4ZC3Y54A.

bieryAtFnal commented 3 years ago

Yes, starting the trgemu process before the rudf process works fine, and that is a completely reasonable solution. Of course, we somehow need to document and/or enforce that ordering of commands so that others know about it, if that becomes the "solution".

DUNE-DAQ / trigemu

trgemu process seems to crash when given start command after ruemu-df process #22

0 /cvmfs/dune.opensciencegrid.org/dunedaq/DUNE/products/ers/v1_1_1/slf7.x86_64.e19.prof/ers/lib64/libers.so(+0x11e6f) [0x7f464faede6f]