art-daq / artdaq

Other
0 stars 3 forks source link

The number of EB buffers seems to affect whether a demo system continues to take data after an EB crash #145

Closed eflumerf closed 2 years ago

eflumerf commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/23050 (FNAL account required) Originally created by @bieryAtFnal on 2019-08-02 18:28:04


The magic number seems to be 10. Buffer counts larger than this experience problems. Systems with buffer counts lower than 10 continue running when an EB dies.

To reproduce this issue:

Then use a command like the following one to start a run:

After the run gets going, look in /daqdata to see that the data file is growing over time.

Kill one of the event builders and check if data is still flowing through the remaining event builders (for example, the data file is still growing).

With 5 EBs, this clearly showed a difference between 9 and 11 buffers per EB. With 3 EBs, such as what we get with the default mediumsystem_with_routing_master config, I'm not seeing a problem with 20 buffers. So, it needs more investigation. ---- ***Related issues:*** - https://github.com/art-daq/artdaq/issues/145 ---- ***Related issues:*** - https://github.com/art-daq/artdaq/issues/145
eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-08-07 21:38:14


I believe that this is related to running with Autodetect or Shmem transfer mode.

Shmem_transfer gets its default value of the number of buffers to use from the TransferInterface base class. And, that default value is 10.

So, in a mediumsystem_with_routing_master test run with EB buffers set to 20, when an EB is killed, component01 thinks that there are 20 buffers available in each of the EBs (including the one that was killed). So, it dutifully tries to send 20. But, once the EB dies, there may have been 20 buffers inside that EB, but the Shmem_transfer that the BoardReader is using to interface with it only has 10 buffers. So, the attempt to send the 11th event after the crash hangs forever.

It seems like Shmem_transfer either needs to know about the health of the reading process or maybe have a timeout on the write.

This issue may be related to a different one (that I've been trying to document for some time) in which an EventBuilder that is receiving fragments via the Shmem_transfer can't be restarted. I haven't filed that Issue yet, and I will at some point. In the meantime, there are hints of that problem in the tests that I describe on 01-Aug in Redmine Issue 21621.

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-08-07 21:42:53


Another bit of trivia is that we can't seem to set the number of buffers used by the Shmem_transfer. Or rather, it isn't clear how to do that. TransferInterface accepts a parameter named "buffers". But, the contents of the destination block in our configuration FCL files is completely book-kept. And, it's not clear how to tell DAQInterface what buffer count to use when it generates all of the destination config entries.

eflumerf commented 2 years ago

Comment by @bieryAtFnal on 2019-08-07 21:54:55


Of course, there is an element of chance in provoking the problem. The crucial number is the number of buffers in the routing table when the EB dies (not the configured number of buffers in the EB). If that is larger than 10, then the system will run into trouble. But, if it's smaller or equal to 10, then the system will keep running.