Closed bieryAtFnal closed 8 months ago
Here are sample instructions for demostrating the problem.
source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest
dbt-create -c -n NAFD24-01-02 02JanFDDev16DataWriterTest
cd 02JanFDDev16DataWriterTest/sourcecode
git clone https://github.com/DUNE-DAQ/daqsystemtest.git -b develop
cd ..
dbt-workarea-env
dbt-build -j 20
dbt-workarea-env
mkdir rundir
cd rundir
cat <<EOF1 > daqconf1.json
{
"boot": {
"use_connectivity_service": true,
"start_connectivity_service": true,
"connectivity_service_host": "localhost",
"connectivity_service_port": 15432
},
"daq_common": {
"data_rate_slowdown_factor": 1
},
"detector": {
"clock_speed_hz": 62500000
},
"readout": {
"use_fake_cards": true,
"default_data_file": "asset://?label=WIBEth&subsystem=readout"
},
"trigger": {
"trigger_window_before_ticks": 1000,
"trigger_window_after_ticks": 1000
},
"dataflow": {
"apps":
[
{ "app_name": "dataflow0" },
{ "app_name": "dataflow1" },
{ "app_name": "dataflow2" },
{ "app_name": "dataflow3" },
{ "app_name": "dataflow4" },
{ "app_name": "dataflow5" },
{ "app_name": "dataflow6" },
{ "app_name": "dataflow7" },
{ "app_name": "dataflow8" },
{ "app_name": "dataflow9" },
{ "app_name": "dataflow10" },
{ "app_name": "dataflow11" },
{ "app_name": "dataflow12" },
{ "app_name": "dataflow13" },
{ "app_name": "dataflow14" },
{ "app_name": "dataflow15" }
]
},
"hsi": {
"random_trigger_rate_hz": 8.0
}
}
EOF1
cat <<EOF2 > dro_map.json
[
{
"src_id": 100,
"geo_id": {
"det_id": 3,
"crate_id": 1,
"slot_id": 0,
"stream_id": 0
},
"kind": "eth",
"parameters": {
"protocol": "udp",
"mode": "fix_rate",
"rx_iface": 0,
"rx_host": "localhost",
"rx_mac": "00:00:00:00:00:00",
"rx_ip": "0.0.0.0",
"tx_host": "localhost",
"tx_mac": "00:00:00:00:00:00",
"tx_ip": "0.0.0.0"
}
},
{
"src_id": 101,
"geo_id": {
"det_id": 3,
"crate_id": 1,
"slot_id": 0,
"stream_id": 1
},
"kind": "eth",
"parameters": {
"protocol": "udp",
"mode": "fix_rate",
"rx_iface": 0,
"rx_host": "localhost",
"rx_mac": "00:00:00:00:00:00",
"rx_ip": "0.0.0.0",
"tx_host": "localhost",
"tx_mac": "00:00:00:00:00:00",
"tx_ip": "0.0.0.0"
}
}
]
EOF2
fddaqconf_gen -c ./daqconf1.json --detector-readout-map-file ./dro_map.json mdapp_wibeth_16df
nanorc --partition-number 2 mdapp_wibeth_16df ${USER}-test boot conf start_run 101 wait 20 stop_run scrap terminate
egrep 'ERROR|WARNING' log_*.txt | grep -v 'Sequence ID continuity' | grep -v chunk_size
ls -alF *.hdf5
ls -alF *.hdf5 | wc -l
mkdir -p backup
cat <<EOF3 > daqconf2.json
{
"boot": {
"use_connectivity_service": true,
"start_connectivity_service": true,
"connectivity_service_host": "localhost",
"connectivity_service_port": 15432
},
"daq_common": {
"data_rate_slowdown_factor": 1
},
"detector": {
"clock_speed_hz": 62500000
},
"readout": {
"use_fake_cards": true,
"default_data_file": "asset://?label=WIBEth&subsystem=readout"
},
"trigger": {
"trigger_window_before_ticks": 1000,
"trigger_window_after_ticks": 1000
},
"dataflow": {
"apps":
[
{ "app_name": "dataflow00" },
{ "app_name": "dataflow01" },
{ "app_name": "dataflow02" },
{ "app_name": "dataflow03" },
{ "app_name": "dataflow04" },
{ "app_name": "dataflow05" },
{ "app_name": "dataflow06" },
{ "app_name": "dataflow07" },
{ "app_name": "dataflow08" },
{ "app_name": "dataflow09" },
{ "app_name": "dataflow10" },
{ "app_name": "dataflow11" },
{ "app_name": "dataflow12" },
{ "app_name": "dataflow13" },
{ "app_name": "dataflow14" },
{ "app_name": "dataflow15" }
]
},
"hsi": {
"random_trigger_rate_hz": 8.0
}
}
EOF3
# zero-padded Dataflow App numbers
fddaqconf_gen -c ./daqconf2.json --detector-readout-map-file ./dro_map.json mdapp_wibeth_16df_zp
nanorc --partition-number 2 mdapp_wibeth_16df_zp ${USER}-test boot conf start_run 102 wait 20 stop_run scrap terminate
egrep 'ERROR|WARNING' log_*.txt | grep -v 'Sequence ID continuity' | grep -v chunk_size
ls -alF *.hdf5
ls -alF *.hdf5 | wc -l
It is worth noting that the problems do not completely go away when we zero-pad the dataflow app number.
In addition to having a fix for this problem, or a list of instructions for how to avoid it, it would be create to have a test application that would check for the failure and verify that it no longer happens when any code changes or special instructions are used.
I will look into an integtest along these lines, but it would also be great to have something in iomanager/test/apps or similar area.
OK, I can see what the problem is. It is the regular expression matching in the connectivity server matching a partial string rather than the whole string so xx_1 and xx_11 both match (it doesn't matter what you call your dataflow apps, the connections get called 'trigger_decision_0' - 'trigger_decision_N-1' and it is here where the clash between _1 and _11 occurs).
I think this can be fixed in a couple of ways. The simplest is to change regex.search
to regex.fullmatch
in the connectivity server. I assume that any code that wants to match multiple uids will include .*
at the end of their regex so this shouldn't break anything else.
To test the update to connectivityserver, in addition to the instructions above, you can:
git clone https://github.com/DUNE-DAQ/connectivityserver -b gcrone/regexFix
pip install ./connectivityserver
Closing this Issue since it has been addressed by PR 5 in the connectivityserver
repo.
I'm not sure if the behavior that I will describe here is a bug or a feature, but I want to document what has been seen so that we have a record. I also don't know whether this Issue should be filed in the
iomanager
repo orConnectivityService
repo, so I'm just picking one.In the high-rate data taking for the offline data challenge in November 2023, sixteen data loggers were used. (The more accurate way to say that is '16 Dataflow Apps were used, and each had a single DataWriter module inside of it.') Each of the DataWriters was configured to use a different storage disk (4 disks on each of 4 servers).
In those runs, there were problems with the data writing on the DF app named "dataflow1". That was the TRB/DataWriter configured to write data to np04-srv-001:/data1. One observed aspect of the problem was that the DFO could not send TriggerDecisions to the TRB, as evidenced by error messages in the DFO log file. There may have been other warning or error messages in the logs, but the overall effect was that no data was written to disk on srv-001:/data1.
I believe that the observed failures were caused by problems with connection name resolution when there were connection UIDs that ended with "1" and with "10", "11".."15 in the same DAQ system configuration.
Here is some console outputs that helps illustrate what happened.
Here are some examples of the observed DFO error messages:
Here is a list of the DFO log files in which the reported errors appeared:
Here is a list of the DFO log files that mention a Dataflow App with number 15:
Here is a list of the DFO log files that mention a Dataflow App with number 3:
The point of those lists of log files was to demonstrate that the log files that included 16 TRBs had the problem with DFApp number 1, but the log files that had only 4 TRBs did not.
Next, I'll post some sample instructions to help demonstrate the problem in emulated-data systems.