DUNE-DAQ / iomanager

Package providing a unified API
0 stars 0 forks source link

Possible flushing of connections between runs #16

Closed mroda88 closed 1 year ago

mroda88 commented 2 years ago

While testing for version 3.0.0 with the following large scale system

daqconf_multiru_gen  -n 8 -s 10 -b 20000 -a 30000 -o /data3/test --disable-trace \
                     --host-df np04-srv-001 \
                     --host-df np04-srv-002 \
                     --host-df np04-srv-003 \
                     --host-df np04-srv-004 \
                     --host-dfo np04-srv-001 \
                     --host-trigger np04-srv-019 \
                     --host-hsi np04-srv-012 \
                     --host-ru np04-srv-011 --region-id 0 \
                     --host-ru np04-srv-012 --region-id 1 \
                     --host-ru np04-srv-013 --region-id 2 \
                     --host-ru np04-srv-014 --region-id 3 \
                     --host-ru np04-srv-015 --region-id 4 \
                     --host-ru np04-srv-016 --region-id 5 \
                     --host-ru np04-srv-019 --region-id 6 \
                     --host-ru np04-srv-021 --region-id 7 \
                     --host-ru np04-srv-022 --region-id 8 \
                     --host-ru np04-srv-023 --region-id 9 \
                     --opmon-impl cern --ers-impl cern \
                     --enable-software-tpg --host-tpw np04-srv-002 --enable-tpset-writing \
                     --enable-dqm --dqm-impl cern \
                     large_scale_system

we noticed that some fragments were missing due to the fact that the stop transition takes quite some time (1 or 2 minutes). The fragments were not completely lost as they were received by the TRB in the following run.

In the discussion we decided to take not of this observation, hence this issue, to be possibly addressed in next releases. Possible solutions might include a forced flush done at start, although we are aware that this is a complicated issue and it might not work.

eflumerf commented 2 years ago

This may be a duplicate of #18

eflumerf commented 2 years ago

@mroda88 Can you test this again? It may have been resolved by #22 and #20

wesketchum commented 1 year ago

Talking to @mroda88 , we're not sure if this is fully resolved, but we will take on this issue and plan to test with v3.2 at protodune-hd with 4 APAs. If that passes/shows no issues, then we'll close this, and can always reopen if we see something come up with larger scale tests in the future.

It doesn't preclude doing this test again as written, but there's some question as to whether that's reasonable or not anyway?

mroda88 commented 1 year ago

With recent test for 4.1.x this problem didn't show up anymore, so this is not relevant anymore.