logfile messages about 'failure to write to the m_tp_sink queue'

bieryAtFnal commented 1 year ago

I'm not sure whether I should file this Issue in this repo (fdreadoutlibs) or in the daqconf repo, but what I observe is complaints in the RU log files about a failure to write to the m_tp_sink queue when using WIB2 emulated data.

Here is the hw_map.txt file that I'm using

# DRO_SourceID DetLink DetSlot DetCrate DetID DRO_Host DRO_Card DRO_SLR DRO_Link 
100 0 0 1 3 localhost 0 0 0
101 1 0 1 3 localhost 0 0 1

Here is the daqconf.json file that I'm using:

{
  "boot": {
    "use_connectivity_service": true,
    "start_connectivity_service": true
  }, 
  "dataflow": {
      "apps": [
          { "app_name": "dataflow0" },
          { "app_name": "dataflow1" }
      ]
  },
  "readout": {
    "enable_software_tpg": true,
    "clock_speed_hz": 62500000,
    "data_rate_slowdown_factor": 10,
    "data_files": [
      {"detector_id": 3, "data_file": "asset://?label=DuneWIB&subsystem=readout"}
    ]
  },
  "trigger": {
    "enable_tpset_writing": true,
    "trigger_activity_config": {"prescale":1000},
    "trigger_window_before_ticks": 1000,
    "trigger_window_after_ticks": 1000,
    "trigger_rate_hz": 1.0
  }
}

Here are the steps that I used to demonstrate the problem:

daqconf_multiru_gen -c ./daqconf.json --hardware-map-file ./hw_map.txt emuWIB2_swtpg_config
nanorc emuWIB2_swtpg_config ${USER}-test boot conf start_run 124 wait 60 stop_run scrap terminate
egrep 'ERROR|WARNING' log_*

The logfile grep shows messages like the following:

log_rulocalhost0_3336.txt:2023-Apr-04 13:48:14,309 ERROR [void dunedaq::fdreadoutlibs::WIB2TPHandler::try_sending_tpsets(uint64_t) at /cvmfs/dunedaq-development.opensciencegrid.org/candidates/rc-v4.0.0-1/spack-0.18.1-gcc-12.1.0/spack-0.18.1/opt/spack/gcc-12.1.0/fdreadoutlibs-rc-v4.0.0-1-xihpv5d3vhvwc4fooz3jf6kcdyjsa7wt/include/fdreadoutlibs/wib2/WIB2TPHandler.hpp:89] SourceID[subsystem: Trigger id: 4] Failed attempt to write to the queue: m_tp_sink. Data will be lost!

adam-abed-abud commented 1 year ago

It looks like the problem is coming from the WIB2TPHandler as part of the SWTPG. I will have a look and try to debug the issue

adam-abed-abud commented 1 year ago

By default, the software_tpg_threshold is set to 100. In case this value is too low, the number of TPs produced may be too much for the receiving sub-system. I have not seen that error from the TPHandler with the following config file (note that I turned off the connectivity service because it was not working for me in this simple setup)

{
  "boot": {
    "use_connectivity_service": false,
    "start_connectivity_service": false
  }, 
  "dataflow": {
      "apps": [
          { "app_name": "dataflow0" },
          { "app_name": "dataflow1" }
      ]
  },
  "readout": {
    "enable_software_tpg": true,
    "software_tpg_threshold": 500,
    "clock_speed_hz": 62500000,
    "data_rate_slowdown_factor": 10,
    "data_files": [
      {"detector_id": 3, "data_file": "asset://?label=DuneWIB&subsystem=readout"}
    ]
  },
  "trigger": {
    "enable_tpset_writing": true,
    "trigger_activity_config": {"prescale":1000},
    "trigger_window_before_ticks": 1000,
    "trigger_window_after_ticks": 1000,
    "trigger_rate_hz": 1.0
  }
}

I also looked into the output tpstream file to check if the values for channels, timestamp and adc looked reasonable. It seems that they are. Let me know if this works fine for you as well.

bieryAtFnal commented 1 year ago

Thanks, Adam, I've confirmed that the higher software_tpg_threshold eliminates those error messages.

Independent of that, I'd like to ask your advice on a different issue that happens when I stop and start multiple runs in the same DAQ session, using the daqconf.json file that you sent.

When I do that, I see messages like the following:

WARNING [void dunedaq::fdreadoutlibs::WIB2TPHandler::try_sending_tpsets(uint64_t) at /home/nfs/dunedaq/daqsw/04AprV4.0.0rc1Testing/sourcecode/fdreadoutlibs/include/fdreadoutlibs/wib2/WIB2TPHandler.hpp:96] Continuity of timestamps broken.

I used a command like the following for this latest test:

nanorc --partition-number 3 mdapp_adam/ ${USER}-test boot conf start_run 1111 wait 20 stop_run wait 2 start_run 1112 wait 20 stop_run wait 2 start_run 1113 wait 20 stop_run scrap terminate

The Warning messages seem to appear after the first run has stopped, and they continue throughout the second and third runs.

Any ideas? Thanks

adam-abed-abud commented 1 year ago

That warning message originates from the fact that during the coldobx runs we noticed that TPSets were arriving in the trigger out of order. The fix for that problem was to drop the TPSets that are older than a certain time after having increased the wait time for producing TPSets. We added that warning to know when this condition was happening. Having said that, the question here is what is happening after you stop the (first) run, probably needs some investigation.

I also agree that in the future it is best to increase the counter of dropped TPSets if we fall in that condition.

DUNE-DAQ / fdreadoutlibs

logfile messages about 'failure to write to the m_tp_sink queue' #97