Closed eflumerf closed 2 years ago
Comment by @bieryAtFnal on 2019-08-07 21:38:14
I believe that this is related to running with Autodetect or Shmem transfer mode.
Shmem_transfer gets its default value of the number of buffers to use from the TransferInterface base class. And, that default value is 10.
So, in a mediumsystem_with_routing_master test run with EB buffers set to 20, when an EB is killed, component01 thinks that there are 20 buffers available in each of the EBs (including the one that was killed). So, it dutifully tries to send 20. But, once the EB dies, there may have been 20 buffers inside that EB, but the Shmem_transfer that the BoardReader is using to interface with it only has 10 buffers. So, the attempt to send the 11th event after the crash hangs forever.
It seems like Shmem_transfer either needs to know about the health of the reading process or maybe have a timeout on the write.
This issue may be related to a different one (that I've been trying to document for some time) in which an EventBuilder that is receiving fragments via the Shmem_transfer can't be restarted. I haven't filed that Issue yet, and I will at some point. In the meantime, there are hints of that problem in the tests that I describe on 01-Aug in Redmine Issue 21621.
Comment by @bieryAtFnal on 2019-08-07 21:42:53
Another bit of trivia is that we can't seem to set the number of buffers used by the Shmem_transfer. Or rather, it isn't clear how to do that. TransferInterface accepts a parameter named "buffers". But, the contents of the destination block in our configuration FCL files is completely book-kept. And, it's not clear how to tell DAQInterface what buffer count to use when it generates all of the destination config entries.
Comment by @bieryAtFnal on 2019-08-07 21:54:55
Of course, there is an element of chance in provoking the problem. The crucial number is the number of buffers in the routing table when the EB dies (not the configured number of buffers in the EB). If that is larger than 10, then the system will run into trouble. But, if it's smaller or equal to 10, then the system will keep running.
This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/23050 (FNAL account required) Originally created by @bieryAtFnal on 2019-08-02 18:28:04
The magic number seems to be 10. Buffer counts larger than this experience problems. Systems with buffer counts lower than 10 continue running when an EB dies.
To reproduce this issue:
Then use a command like the following one to start a run:
pwd
/artdaq-utilities-daqinterface/simple_test_config/mediumsystem_with_routing_master/boot.txt --comps component01 component02 component03 component04 component05 component06 component07 component08 component09 component10 --runduration 300 --partition 0 --no_omAfter the run gets going, look in/daqdata to see that the data file is growing over time.
Kill one of the event builders and check if data is still flowing through the remaining event builders (for example, the data file is still growing).