DUNE-DAQ / iomanager

Package providing a unified API
0 stars 0 forks source link

DFApp1 does not get TriggerDecisions when there are 16 DFApps in use #71

Closed bieryAtFnal closed 8 months ago

bieryAtFnal commented 9 months ago

I'm not sure if the behavior that I will describe here is a bug or a feature, but I want to document what has been seen so that we have a record. I also don't know whether this Issue should be filed in the iomanager repo or ConnectivityService repo, so I'm just picking one.

In the high-rate data taking for the offline data challenge in November 2023, sixteen data loggers were used. (The more accurate way to say that is '16 Dataflow Apps were used, and each had a single DataWriter module inside of it.') Each of the DataWriters was configured to use a different storage disk (4 disks on each of 4 servers).

In those runs, there were problems with the data writing on the DF app named "dataflow1". That was the TRB/DataWriter configured to write data to np04-srv-001:/data1. One observed aspect of the problem was that the DFO could not send TriggerDecisions to the TRB, as evidenced by error messages in the DFO log file. There may have been other warning or error messages in the logs, but the overall effect was that no data was written to disk on srv-001:/data1.

I believe that the observed failures were caused by problems with connection name resolution when there were connection UIDs that ended with "1" and with "10", "11".."15 in the same DAQ system configuration.

Here is some console outputs that helps illustrate what happened.

Here are some examples of the observed DFO error messages:

[biery@np04-srv-024 log]$ egrep 'ERR|WARN|WRN' log*dfo*.txt | grep -v 'Connection refus' | grep 'Send to connection' | grep failed | head -12
log_2023-11-13_141222_dfo_6435.txt:2023-Nov-13 14:14:07,667 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_141222_dfo_6435.txt:2023-Nov-13 14:14:07,772 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_141222_dfo_6435.txt:2023-Nov-13 14:14:07,876 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_141222_dfo_6435.txt:2023-Nov-13 14:14:07,978 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_141222_dfo_6435.txt:2023-Nov-13 14:14:08,081 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_153445_dfo_6435.txt:2023-Nov-13 15:36:16,362 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_153445_dfo_6435.txt:2023-Nov-13 15:36:16,467 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_153445_dfo_6435.txt:2023-Nov-13 15:36:16,569 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_153445_dfo_6435.txt:2023-Nov-13 15:36:16,671 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-13_153445_dfo_6435.txt:2023-Nov-13 15:36:16,773 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-14_105001_dfo_8707.txt:2023-Nov-14 10:52:12,743 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed
log_2023-11-14_105001_dfo_8707.txt:2023-Nov-14 10:52:12,848 WARNING [bool dunedaq::dfmodules::DataFlowOrchestrator::dispatch(const std::shared_ptr<dunedaq::dfmodules::AssignedTriggerDecision>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.12.1-u42hrx655ufengjresazk2wticwwcfqj/spack-src/plugins/DataFlowOrchestrator.cpp:454] Send to connection "trigger_decision_1" failed

Here is a list of the DFO log files in which the reported errors appeared:

[biery@np04-srv-024 log]$ egrep 'ERR|WARN|WRN' log*dfo*.txt | grep -v 'Connection refus' | grep 'Send to connection' | grep failed | cutsort
log_2023-11-13_141222_dfo_6435.txt
log_2023-11-13_153445_dfo_6435.txt
log_2023-11-14_105001_dfo_8707.txt
log_2023-11-14_105929_dfo_8657.txt
log_2023-11-14_172843_dfo_8657.txt
log_2023-11-14_173426_dfo_8657.txt
log_2023-11-14_202342_dfo_8657.txt
log_2023-11-21_113248_dfo_8707.txt
log_2023-11-21_114430_dfo_8707.txt

Here is a list of the DFO log files that mention a Dataflow App with number 15:

[biery@np04-srv-024 log]$ grep trigger_decision_15 *dfo*.txt | cutsort
log_2023-11-13_141222_dfo_6435.txt
log_2023-11-13_153445_dfo_6435.txt
log_2023-11-14_105001_dfo_8707.txt
log_2023-11-14_105929_dfo_8657.txt
log_2023-11-14_172843_dfo_8657.txt
log_2023-11-14_173426_dfo_8657.txt
log_2023-11-14_202342_dfo_8657.txt
log_2023-11-21_113248_dfo_8707.txt
log_2023-11-21_114430_dfo_8707.txt

Here is a list of the DFO log files that mention a Dataflow App with number 3:

[biery@np04-srv-024 log]$ grep trigger_decision_3 *dfo*.txt | cutsort
log_2023-11-07_120415_dfo_5657.txt
log_2023-11-07_122053_dfo_5657.txt
log_2023-11-07_125301_dfo_10157.txt
log_2023-11-07_130159_dfo_10157.txt
log_2023-11-13_121555_dfo_6435.txt
log_2023-11-13_123341_dfo_6435.txt
log_2023-11-13_141222_dfo_6435.txt
log_2023-11-13_153445_dfo_6435.txt
log_2023-11-14_105001_dfo_8707.txt
log_2023-11-14_105929_dfo_8657.txt
log_2023-11-14_172843_dfo_8657.txt
log_2023-11-14_173426_dfo_8657.txt
log_2023-11-14_202342_dfo_8657.txt
log_2023-11-21_113248_dfo_8707.txt
log_2023-11-21_114430_dfo_8707.txt

The point of those lists of log files was to demonstrate that the log files that included 16 TRBs had the problem with DFApp number 1, but the log files that had only 4 TRBs did not.

Next, I'll post some sample instructions to help demonstrate the problem in emulated-data systems.

bieryAtFnal commented 9 months ago

Here are sample instructions for demostrating the problem.

source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest
dbt-create -c -n NAFD24-01-02 02JanFDDev16DataWriterTest
cd 02JanFDDev16DataWriterTest/sourcecode
git clone https://github.com/DUNE-DAQ/daqsystemtest.git -b develop
cd ..

dbt-workarea-env
dbt-build -j 20
dbt-workarea-env

mkdir rundir
cd rundir

cat <<EOF1 > daqconf1.json
{
  "boot": {
    "use_connectivity_service": true,
    "start_connectivity_service": true,
    "connectivity_service_host": "localhost",
    "connectivity_service_port": 15432
  }, 
  "daq_common": {
    "data_rate_slowdown_factor": 1
  },
  "detector": {
    "clock_speed_hz": 62500000
  },
  "readout": {
    "use_fake_cards": true,
    "default_data_file": "asset://?label=WIBEth&subsystem=readout"
  },
  "trigger": {
    "trigger_window_before_ticks": 1000,
    "trigger_window_after_ticks": 1000
  },
  "dataflow": {
      "apps":
      [
          { "app_name": "dataflow0" },
          { "app_name": "dataflow1" },
          { "app_name": "dataflow2" },
          { "app_name": "dataflow3" },
          { "app_name": "dataflow4" },
          { "app_name": "dataflow5" },
          { "app_name": "dataflow6" },
          { "app_name": "dataflow7" },
          { "app_name": "dataflow8" },
          { "app_name": "dataflow9" },
          { "app_name": "dataflow10" },
          { "app_name": "dataflow11" },
          { "app_name": "dataflow12" },
          { "app_name": "dataflow13" },
          { "app_name": "dataflow14" },
          { "app_name": "dataflow15" }
      ]
  },
  "hsi": {
    "random_trigger_rate_hz": 8.0
  }
}
EOF1

cat <<EOF2 > dro_map.json
[
    {
        "src_id": 100,
        "geo_id": {
            "det_id": 3,
            "crate_id": 1,
            "slot_id": 0,
            "stream_id": 0
        },
        "kind": "eth",
        "parameters": {
            "protocol": "udp",
            "mode": "fix_rate",
            "rx_iface": 0,
            "rx_host": "localhost",
            "rx_mac": "00:00:00:00:00:00",
            "rx_ip": "0.0.0.0",
            "tx_host": "localhost",
            "tx_mac": "00:00:00:00:00:00",
            "tx_ip": "0.0.0.0"
        }
    },
    {
        "src_id": 101,
        "geo_id": {
            "det_id": 3,
            "crate_id": 1,
            "slot_id": 0,
            "stream_id": 1
        },
        "kind": "eth",
        "parameters": {
            "protocol": "udp",
            "mode": "fix_rate",
            "rx_iface": 0,
            "rx_host": "localhost",
            "rx_mac": "00:00:00:00:00:00",
            "rx_ip": "0.0.0.0",
            "tx_host": "localhost",
            "tx_mac": "00:00:00:00:00:00",
            "tx_ip": "0.0.0.0"
        }
    }
]
EOF2

fddaqconf_gen -c ./daqconf1.json --detector-readout-map-file ./dro_map.json mdapp_wibeth_16df

nanorc --partition-number 2 mdapp_wibeth_16df ${USER}-test boot conf start_run 101 wait 20 stop_run scrap terminate

egrep 'ERROR|WARNING' log_*.txt | grep -v 'Sequence ID continuity' | grep -v chunk_size

ls -alF *.hdf5

ls -alF *.hdf5 | wc -l

mkdir -p backup

cat <<EOF3 > daqconf2.json
{
  "boot": {
    "use_connectivity_service": true,
    "start_connectivity_service": true,
    "connectivity_service_host": "localhost",
    "connectivity_service_port": 15432
  }, 
  "daq_common": {
    "data_rate_slowdown_factor": 1
  },
  "detector": {
    "clock_speed_hz": 62500000
  },
  "readout": {
    "use_fake_cards": true,
    "default_data_file": "asset://?label=WIBEth&subsystem=readout"
  },
  "trigger": {
    "trigger_window_before_ticks": 1000,
    "trigger_window_after_ticks": 1000
  },
  "dataflow": {
      "apps":
      [
          { "app_name": "dataflow00" },
          { "app_name": "dataflow01" },
          { "app_name": "dataflow02" },
          { "app_name": "dataflow03" },
          { "app_name": "dataflow04" },
          { "app_name": "dataflow05" },
          { "app_name": "dataflow06" },
          { "app_name": "dataflow07" },
          { "app_name": "dataflow08" },
          { "app_name": "dataflow09" },
          { "app_name": "dataflow10" },
          { "app_name": "dataflow11" },
          { "app_name": "dataflow12" },
          { "app_name": "dataflow13" },
          { "app_name": "dataflow14" },
          { "app_name": "dataflow15" }
      ]
  },
  "hsi": {
    "random_trigger_rate_hz": 8.0
  }
}
EOF3

# zero-padded Dataflow App numbers
fddaqconf_gen -c ./daqconf2.json --detector-readout-map-file ./dro_map.json mdapp_wibeth_16df_zp

nanorc --partition-number 2 mdapp_wibeth_16df_zp ${USER}-test boot conf start_run 102 wait 20 stop_run scrap terminate

egrep 'ERROR|WARNING' log_*.txt | grep -v 'Sequence ID continuity' | grep -v chunk_size

ls -alF *.hdf5

ls -alF *.hdf5 | wc -l

It is worth noting that the problems do not completely go away when we zero-pad the dataflow app number.

bieryAtFnal commented 9 months ago

In addition to having a fix for this problem, or a list of instructions for how to avoid it, it would be create to have a test application that would check for the failure and verify that it no longer happens when any code changes or special instructions are used.

I will look into an integtest along these lines, but it would also be great to have something in iomanager/test/apps or similar area.

gcrone commented 9 months ago

OK, I can see what the problem is. It is the regular expression matching in the connectivity server matching a partial string rather than the whole string so xx_1 and xx_11 both match (it doesn't matter what you call your dataflow apps, the connections get called 'trigger_decision_0' - 'trigger_decision_N-1' and it is here where the clash between _1 and _11 occurs). I think this can be fixed in a couple of ways. The simplest is to change regex.search to regex.fullmatch in the connectivity server. I assume that any code that wants to match multiple uids will include .* at the end of their regex so this shouldn't break anything else.

gcrone commented 9 months ago

To test the update to connectivityserver, in addition to the instructions above, you can:

git clone https://github.com/DUNE-DAQ/connectivityserver -b gcrone/regexFix
pip install ./connectivityserver
bieryAtFnal commented 8 months ago

Closing this Issue since it has been addressed by PR 5 in the connectivityserver repo.