The number of EB buffers seems to affect whether a demo system continues to take data after an EB crash

eflumerf commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/23050 (FNAL account required) Originally created by @bieryAtFnal on 2019-08-02 18:28:04

The magic number seems to be 10. Buffer counts larger than this experience problems. Systems with buffer counts lower than 10 continue running when an EB dies.

To reproduce this issue:

'wget https://cdcvs.fnal.gov/redmine/projects/artdaq-demo/repository/revisions/develop/raw/tools/quick-mrb-start.sh'
'chmod +x quick-mrb-start.sh'
'./quick-mrb-start.sh --tag=v3_06_00'
modify the DAQInterface settings to use direct process control
- add a line like the following to your /DAQInterface/user_sourcefile_example ***** export DAQINTERFACE_PROCESS_MANAGEMENT_METHOD="direct"
tell DAQInterface to continue running after an EB has crashed
- make a copy of /DAQInterface/process_requirements_list_example and edit the EventBuilder line so that A) it is not commented out, and B) the floating point number on its line is 0.5
- add a line like the following to your /DAQInterface/user_sourcefile_example ***** export DAQINTERFACE_PROCESS_REQUIREMENTS_LIST=${yourArtdaqInstallationDir}/DAQInterface/
you may also need to edit /DAQInterface/settings_example to set a better value for the productsdir_for_bash_scripts parameter

Then use a command like the following one to start a run:

sh ./run_demo.sh --config mediumsystem_with_routing_master --bootfile pwd/artdaq-utilities-daqinterface/simple_test_config/mediumsystem_with_routing_master/boot.txt --comps component01 component02 component03 component04 component05 component06 component07 component08 component09 component10 --runduration 300 --partition 0 --no_om

After the run gets going, look in /daqdata to see that the data file is growing over time.

Kill one of the event builders and check if data is still flowing through the remaining event builders (for example, the data file is still growing).

With 5 EBs, this clearly showed a difference between 9 and 11 buffers per EB. With 3 EBs, such as what we get with the default mediumsystem_with_routing_master config, I'm not seeing a problem with 20 buffers. So, it needs more investigation. ---- ***Related issues:*** - https://github.com/art-daq/artdaq/issues/145 ---- ***Related issues:*** - https://github.com/art-daq/artdaq/issues/145

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-08-07 21:38:14

I believe that this is related to running with Autodetect or Shmem transfer mode.

Shmem_transfer gets its default value of the number of buffers to use from the TransferInterface base class. And, that default value is 10.

So, in a mediumsystem_with_routing_master test run with EB buffers set to 20, when an EB is killed, component01 thinks that there are 20 buffers available in each of the EBs (including the one that was killed). So, it dutifully tries to send 20. But, once the EB dies, there may have been 20 buffers inside that EB, but the Shmem_transfer that the BoardReader is using to interface with it only has 10 buffers. So, the attempt to send the 11th event after the crash hangs forever.

It seems like Shmem_transfer either needs to know about the health of the reading process or maybe have a timeout on the write.

This issue may be related to a different one (that I've been trying to document for some time) in which an EventBuilder that is receiving fragments via the Shmem_transfer can't be restarted. I haven't filed that Issue yet, and I will at some point. In the meantime, there are hints of that problem in the tests that I describe on 01-Aug in Redmine Issue 21621.

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-08-07 21:42:53

Another bit of trivia is that we can't seem to set the number of buffers used by the Shmem_transfer. Or rather, it isn't clear how to do that. TransferInterface accepts a parameter named "buffers". But, the contents of the destination block in our configuration FCL files is completely book-kept. And, it's not clear how to tell DAQInterface what buffer count to use when it generates all of the destination config entries.

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-08-07 21:54:55

Of course, there is an element of chance in provoking the problem. The crucial number is the number of buffers in the routing table when the EB dies (not the configured number of buffers in the EB). If that is larger than 10, then the system will run into trouble. But, if it's smaller or equal to 10, then the system will keep running.

art-daq / artdaq

The number of EB buffers seems to affect whether a demo system continues to take data after an EB crash #145