Closed bieryAtFnal closed 8 months ago
Here are sample instructions for testing these changes:
source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest
dbt-create -n NFD_PROD4_240318_A9 18MarFDDevTardyTPChangeTesting
cd 18MarFDDevTardyTPChangeTesting/sourcecode
git clone https://github.com/DUNE-DAQ/daqconf.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/daqsystemtest.git -b kbiery/only_for_tardy_tp_testing
git clone https://github.com/DUNE-DAQ/dfmodules.git -b production/v4
git clone https://github.com/DUNE-DAQ/fddaqconf.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/fdreadoutlibs.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/fdreadoutmodules.git -b production/v4
git clone https://github.com/DUNE-DAQ/readoutlibs.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/readoutmodules.git -b production/v4
cd ..
dbt-workarea-env
dbt-build -j 20
dbt-workarea-env
daqsystemtest_integtest_bundle.sh -s 2 -f 2 -l 2 -N 3 --stop-on-fail
The last step runs the 3ru_3df_multirun_test.py
integtest multiple times and stops if/when there is a failure. These instructions include a modified version of the 3ru_3df_multirun_test.py
integtest that is designed to demonstrate the presence of Tardy TPs. (it sets the tpset_min_latency_ticks
config parameter to a lower-than-usual value.)
If the first running of the daqsystemtest_integtest_bundle.py
script does not demonstrate the problem, please try running it again.
A failure will show console output like the following:
Problem(s) found in logfile /tmp/pytest-of-biery/pytest-3535/run1/log_rulocalhosteth1_4337.txt:
2024-Mar-18 13:31:00,273 WARNING [virtual void dunedaq::fdreadoutlibs::TPCTPRequestHandler::report_tardy_packet(const dunedaq::fdreadoutlibs::types::TriggerPrimitiveTypeAdapter&, int64_t) at /home/nfs/biery/dunedaq/18MarFDDevTardyTPChangeTesting/sourcecode/fdreadoutlibs/src/TPCTPRequestHandler.cpp:91] SourceID[Trigger,1] Received late data packet, TP channel 1011, 0 ms beyond the allowed latency (20 ms, 1250000 ticks).
2024-Mar-18 13:31:00,274 WARNING [virtual void dunedaq::fdreadoutlibs::TPCTPRequestHandler::report_tardy_packet(const dunedaq::fdreadoutlibs::types::TriggerPrimitiveTypeAdapter&, int64_t) at /home/nfs/biery/dunedaq/18MarFDDevTardyTPChangeTesting/sourcecode/fdreadoutlibs/src/TPCTPRequestHandler.cpp:91] SourceID[Trigger,1] Received late data packet, TP channel 1000, 0.49152 ms beyond the allowed latency (20 ms, 1250000 ticks).
...
2024-Mar-18 13:31:18,724 WARNING [virtual void dunedaq::fdreadoutlibs::TPCTPRequestHandler::report_tardy_packet(const dunedaq::fdreadoutlibs::types::TriggerPrimitiveTypeAdapter&, int64_t) at /home/nfs/biery/dunedaq/18MarFDDevTardyTPChangeTesting/sourcecode/fdreadoutlibs/src/TPCTPRequestHandler.cpp:91] SourceID[Trigger,1] Received late data packet, TP channel 193, 179.896 ms beyond the allowed latency (20 ms, 1250000 ticks). -- 100000 similar messages suppressed, last occurrence was at 2024-Mar-18 13:31:18,723630
FSanity-check passed
All Attribute tests passed for file swtest_run000101_0000_dataflow0_datawriter_0_20240318T183059.hdf5
Event count 104 is outside the tolerance of 6.0 from an expected value of 120.0
F
=================================== FAILURES ===================================
_______________ test_log_files[Software_TPG_System-run_nanorc0] ________________
run_nanorc = <integrationtest.integrationtest_nanorc.run_nanorc.<locals>.RunResult object at 0x7eff8676faf0>
def test_log_files(run_nanorc):
if check_for_logfile_errors:
# Check that there are no warnings or errors in the log files
> assert log_file_checks.logs_are_error_free(run_nanorc.log_files, True, True, ignored_logfile_problems)
E AssertionError: assert False
E + where False = <function logs_are_error_free at 0x7eff8b7c5b40>([PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_connectionservice_16043.txt'), PosixPath('/tmp/pytest-of-biery/p...iery/pytest-3535/run1/log_dfo_4335.txt'), PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_fakehsi_4333.txt'), ...], True, True, {})
E + where <function logs_are_error_free at 0x7eff8b7c5b40> = log_file_checks.logs_are_error_free
E + and [PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_connectionservice_16043.txt'), PosixPath('/tmp/pytest-of-biery/p...iery/pytest-3535/run1/log_dfo_4335.txt'), PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_fakehsi_4333.txt'), ...] = <integrationtest.integrationtest_nanorc.run_nanorc.<locals>.RunResult object at 0x7eff8676faf0>.log_files
sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py:138: AssertionError
_______________ test_data_files[Software_TPG_System-run_nanorc0] _______________
run_nanorc = <integrationtest.integrationtest_nanorc.run_nanorc.<locals>.RunResult object at 0x7eff8676faf0>
def test_data_files(run_nanorc):
local_expected_event_count=expected_event_count
local_event_count_tolerance=expected_event_count_tolerance
low_number_of_files=expected_number_of_data_files
high_number_of_files=expected_number_of_data_files
fragment_check_list=[triggercandidate_frag_params, hsi_frag_params]
if "enable_tpg" in run_nanorc.confgen_config["readout"].keys() and run_nanorc.confgen_config["readout"]["enable_tpg"]:
local_expected_event_count+=(250*number_of_data_producers*number_of_readout_apps*run_duration/(100*number_of_dataflow_apps))
local_event_count_tolerance+=(10*number_of_data_producers*number_of_readout_apps*run_duration/(100*number_of_dataflow_apps))
#fragment_check_list.append(wib2_frag_multi_trig_params) # DuneWIB
fragment_check_list.append(wibeth_frag_multi_trig_params) # WIBEth
fragment_check_list.append(triggertp_frag_params)
fragment_check_list.append(triggeractivity_frag_params)
else:
low_number_of_files-=number_of_dataflow_apps
if low_number_of_files < 1:
low_number_of_files=1
#fragment_check_list.append(wib2_frag_hsi_trig_params) # DuneWIB
fragment_check_list.append(wibeth_frag_hsi_trig_params) # WIBEth
# Run some tests on the output data file
assert len(run_nanorc.data_files)==high_number_of_files or len(run_nanorc.data_files)==low_number_of_files
for idx in range(len(run_nanorc.data_files)):
data_file=data_file_checks.DataFile(run_nanorc.data_files[idx])
assert data_file_checks.sanity_check(data_file)
assert data_file_checks.check_file_attributes(data_file)
> assert data_file_checks.check_event_count(data_file, local_expected_event_count, local_event_count_tolerance)
E assert False
E + where False = <function check_event_count at 0x7eff8b7c57e0>(<integrationtest.data_file_checks.DataFile object at 0x7eff86393e80>, 120.0, 6.0)
E + where <function check_event_count at 0x7eff8b7c57e0> = data_file_checks.check_event_count
sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py:167: AssertionError
=========================== short test summary info ============================
FAILED sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py::test_log_files[Software_TPG_System-run_nanorc0]
FAILED sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py::test_data_files[Software_TPG_System-run_nanorc0]
=================== 2 failed, 4 passed in 188.72s (0:03:08) ====================
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ SUMMARY ++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Mon Mar 18 01:32:21 PM CDT 2024
Log file is: /tmp/pytest-of-biery/daqsystemtest_integtest_bundle_20240318132549.log
===== Running 3ru_3df_multirun_test.py
======================== 6 passed in 202.17s (0:03:22) =========================
===== Running 3ru_3df_multirun_test.py
=================== 2 failed, 4 passed in 188.72s (0:03:08) ====================
If desired, the 3ru_3df integtest can be locally modified to change the value of one or both of the relevant configuration parameters:
Longer values for the tpset_min_latency_ticks
will reduce the probability of demonstrating the problem. Longer values for tardy_tp_quiet_time_at_start_sec
will reduce the number of warning messages that are produced when the problem occurs.
So, in trying to test this at NP04, I had to pull down the production/v4 branch of rcif in order to build properly. Then, in running, readout affinity wasn't applied?
Executing ['readout-affinity.py --pinfile ${DUNEDAQ_THREAD_PIN_FILE}'] script on 'np04-srv-001':
export DUNEDAQ_THREAD_PIN_FILE="/nfs/sw/dunedaq/18MarFDDevTardyTPChangeTesting/np04daq-configs/cpupin_files/cpupin-all-running.json"; export
LD_LIBRARY_PATH="/cvmfs/dunedaq.opensciencegrid.org/spack/externals...rvbv7wiq57sgvs5/lib:/home/fwComponents_np04rc/bin:"; export
PATH="/cvmfs/dunedaq.opensciencegrid.org/spack/externals.../nfs/home/np04daq/.local/bin:/nfs/home/np04daq/bin"; readout-affinity.py --pinfile
${DUNEDAQ_THREAD_PIN_FILE}
ERROR Traceback (most recent call last): sshpm.py:180
File "/nfs/sw/dunedaq/18MarFDDevTardyTPChangeTesting/install/readoutlibs/bin/readout-affinity.py", line 6, in
<module>
import psutil
ModuleNotFoundError: No module named 'psutil'
but I don't see how/why this change would affect that at all (also, import psutil
from python works locally, so not sure).
In any case though, I can verify that the tp tardy messages don't appear until after 10 seconds from the start of the run, and that the message is the desired format. Given that I don't think the above should have anything to do with this and the other tests work, I'm marking this approved.
Based on various discussions that we have had since the Tardy-TP warnings were first introduced, this PR (along with 3 in other repos) improves the content of the warning message and provides the ability to suppress the warning message at the start of a run, based on a configurable time interval. The suppression of the warning message does not affect the counting of any Tardy TPs. Those will still appear in the metrics that are reported.
Here is a sample of the updated warning message:
The default value for the quiet time at the start of the run is 10 seconds. It can be changed using the
tardy_tp_quiet_time_at_start_sec
parameter in thereadout
section of thedaqconf
configuration inputs.The changes in other repos that should be included with this one are DUNE-DAQ/readoutlibs#160, DUNE-DAQ/daqconf#434, and DUNE-DAQ/fddaqconf#28.