DUNE-DAQ / fdreadoutlibs

fdreadoutlibs
0 stars 3 forks source link

Improved the Tardy-TP warning message and added suppression of it at start-run #166

Closed bieryAtFnal closed 8 months ago

bieryAtFnal commented 8 months ago

Based on various discussions that we have had since the Tardy-TP warnings were first introduced, this PR (along with 3 in other repos) improves the content of the warning message and provides the ability to suppress the warning message at the start of a run, based on a configurable time interval. The suppression of the warning message does not affect the counting of any Tardy TPs. Those will still appear in the metrics that are reported.

Here is a sample of the updated warning message:

2024-Mar-18 13:00:02,866 WARNING [virtual void dunedaq::fdreadoutlibs::TPCTPRequestHandler::report_tardy_packet(const dunedaq::fdreadoutlibs::types::TriggerPrimitiveTypeAdapter&, int64_t) at /home/nfs/biery/dunedaq/18MarFDDevTardyTPImprovements/sourcecode/fdreadoutlibs/src/TPCTPRequestHandler.cpp:91] SourceID[Trigger,1] Received late data packet, TP channel 1856, 0.16384 ms beyond the allowed latency (20 ms, 1250000 ticks).

The default value for the quiet time at the start of the run is 10 seconds. It can be changed using the tardy_tp_quiet_time_at_start_sec parameter in the readout section of the daqconf configuration inputs.

The changes in other repos that should be included with this one are DUNE-DAQ/readoutlibs#160, DUNE-DAQ/daqconf#434, and DUNE-DAQ/fddaqconf#28.

bieryAtFnal commented 8 months ago

Here are sample instructions for testing these changes:

source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest
dbt-create -n NFD_PROD4_240318_A9 18MarFDDevTardyTPChangeTesting
cd 18MarFDDevTardyTPChangeTesting/sourcecode
git clone https://github.com/DUNE-DAQ/daqconf.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/daqsystemtest.git -b kbiery/only_for_tardy_tp_testing
git clone https://github.com/DUNE-DAQ/dfmodules.git -b production/v4
git clone https://github.com/DUNE-DAQ/fddaqconf.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/fdreadoutlibs.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/fdreadoutmodules.git -b production/v4
git clone https://github.com/DUNE-DAQ/readoutlibs.git -b kbiery/improving_latetp_warning_message
git clone https://github.com/DUNE-DAQ/readoutmodules.git -b production/v4
cd ..

dbt-workarea-env
dbt-build -j 20
dbt-workarea-env

daqsystemtest_integtest_bundle.sh -s 2 -f 2 -l 2 -N 3 --stop-on-fail

The last step runs the 3ru_3df_multirun_test.py integtest multiple times and stops if/when there is a failure. These instructions include a modified version of the 3ru_3df_multirun_test.py integtest that is designed to demonstrate the presence of Tardy TPs. (it sets the tpset_min_latency_ticks config parameter to a lower-than-usual value.)

If the first running of the daqsystemtest_integtest_bundle.py script does not demonstrate the problem, please try running it again.

A failure will show console output like the following:

Problem(s) found in logfile /tmp/pytest-of-biery/pytest-3535/run1/log_rulocalhosteth1_4337.txt:
2024-Mar-18 13:31:00,273 WARNING [virtual void dunedaq::fdreadoutlibs::TPCTPRequestHandler::report_tardy_packet(const dunedaq::fdreadoutlibs::types::TriggerPrimitiveTypeAdapter&, int64_t) at /home/nfs/biery/dunedaq/18MarFDDevTardyTPChangeTesting/sourcecode/fdreadoutlibs/src/TPCTPRequestHandler.cpp:91] SourceID[Trigger,1] Received late data packet, TP channel 1011, 0 ms beyond the allowed latency (20 ms, 1250000 ticks).

2024-Mar-18 13:31:00,274 WARNING [virtual void dunedaq::fdreadoutlibs::TPCTPRequestHandler::report_tardy_packet(const dunedaq::fdreadoutlibs::types::TriggerPrimitiveTypeAdapter&, int64_t) at /home/nfs/biery/dunedaq/18MarFDDevTardyTPChangeTesting/sourcecode/fdreadoutlibs/src/TPCTPRequestHandler.cpp:91] SourceID[Trigger,1] Received late data packet, TP channel 1000, 0.49152 ms beyond the allowed latency (20 ms, 1250000 ticks).

...

2024-Mar-18 13:31:18,724 WARNING [virtual void dunedaq::fdreadoutlibs::TPCTPRequestHandler::report_tardy_packet(const dunedaq::fdreadoutlibs::types::TriggerPrimitiveTypeAdapter&, int64_t) at /home/nfs/biery/dunedaq/18MarFDDevTardyTPChangeTesting/sourcecode/fdreadoutlibs/src/TPCTPRequestHandler.cpp:91] SourceID[Trigger,1] Received late data packet, TP channel 193, 179.896 ms beyond the allowed latency (20 ms, 1250000 ticks). -- 100000 similar messages suppressed, last occurrence was at 2024-Mar-18 13:31:18,723630

FSanity-check passed
All Attribute tests passed for file swtest_run000101_0000_dataflow0_datawriter_0_20240318T183059.hdf5
Event count 104 is outside the tolerance of 6.0 from an expected value of 120.0
F

=================================== FAILURES ===================================
_______________ test_log_files[Software_TPG_System-run_nanorc0] ________________

run_nanorc = <integrationtest.integrationtest_nanorc.run_nanorc.<locals>.RunResult object at 0x7eff8676faf0>

    def test_log_files(run_nanorc):
        if check_for_logfile_errors:
            # Check that there are no warnings or errors in the log files
>           assert log_file_checks.logs_are_error_free(run_nanorc.log_files, True, True, ignored_logfile_problems)
E           AssertionError: assert False
E            +  where False = <function logs_are_error_free at 0x7eff8b7c5b40>([PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_connectionservice_16043.txt'), PosixPath('/tmp/pytest-of-biery/p...iery/pytest-3535/run1/log_dfo_4335.txt'), PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_fakehsi_4333.txt'), ...], True, True, {})
E            +    where <function logs_are_error_free at 0x7eff8b7c5b40> = log_file_checks.logs_are_error_free
E            +    and   [PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_connectionservice_16043.txt'), PosixPath('/tmp/pytest-of-biery/p...iery/pytest-3535/run1/log_dfo_4335.txt'), PosixPath('/tmp/pytest-of-biery/pytest-3535/run1/log_fakehsi_4333.txt'), ...] = <integrationtest.integrationtest_nanorc.run_nanorc.<locals>.RunResult object at 0x7eff8676faf0>.log_files

sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py:138: AssertionError
_______________ test_data_files[Software_TPG_System-run_nanorc0] _______________

run_nanorc = <integrationtest.integrationtest_nanorc.run_nanorc.<locals>.RunResult object at 0x7eff8676faf0>

    def test_data_files(run_nanorc):
        local_expected_event_count=expected_event_count
        local_event_count_tolerance=expected_event_count_tolerance
        low_number_of_files=expected_number_of_data_files
        high_number_of_files=expected_number_of_data_files
        fragment_check_list=[triggercandidate_frag_params, hsi_frag_params]
        if "enable_tpg" in run_nanorc.confgen_config["readout"].keys() and run_nanorc.confgen_config["readout"]["enable_tpg"]:
            local_expected_event_count+=(250*number_of_data_producers*number_of_readout_apps*run_duration/(100*number_of_dataflow_apps))
            local_event_count_tolerance+=(10*number_of_data_producers*number_of_readout_apps*run_duration/(100*number_of_dataflow_apps))
            #fragment_check_list.append(wib2_frag_multi_trig_params) # DuneWIB
            fragment_check_list.append(wibeth_frag_multi_trig_params) # WIBEth
            fragment_check_list.append(triggertp_frag_params)
            fragment_check_list.append(triggeractivity_frag_params)
        else:
            low_number_of_files-=number_of_dataflow_apps
            if low_number_of_files < 1:
                low_number_of_files=1
            #fragment_check_list.append(wib2_frag_hsi_trig_params) # DuneWIB
            fragment_check_list.append(wibeth_frag_hsi_trig_params) # WIBEth

        # Run some tests on the output data file
        assert len(run_nanorc.data_files)==high_number_of_files or len(run_nanorc.data_files)==low_number_of_files

        for idx in range(len(run_nanorc.data_files)):
            data_file=data_file_checks.DataFile(run_nanorc.data_files[idx])
            assert data_file_checks.sanity_check(data_file)
            assert data_file_checks.check_file_attributes(data_file)
>           assert data_file_checks.check_event_count(data_file, local_expected_event_count, local_event_count_tolerance)
E           assert False
E            +  where False = <function check_event_count at 0x7eff8b7c57e0>(<integrationtest.data_file_checks.DataFile object at 0x7eff86393e80>, 120.0, 6.0)
E            +    where <function check_event_count at 0x7eff8b7c57e0> = data_file_checks.check_event_count

sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py:167: AssertionError
=========================== short test summary info ============================
FAILED sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py::test_log_files[Software_TPG_System-run_nanorc0]
FAILED sourcecode/daqsystemtest/integtest/3ru_3df_multirun_test.py::test_data_files[Software_TPG_System-run_nanorc0]
=================== 2 failed, 4 passed in 188.72s (0:03:08) ====================

+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ SUMMARY ++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

Mon Mar 18 01:32:21 PM CDT 2024
Log file is: /tmp/pytest-of-biery/daqsystemtest_integtest_bundle_20240318132549.log

===== Running 3ru_3df_multirun_test.py
======================== 6 passed in 202.17s (0:03:22) =========================
===== Running 3ru_3df_multirun_test.py
=================== 2 failed, 4 passed in 188.72s (0:03:08) ====================

If desired, the 3ru_3df integtest can be locally modified to change the value of one or both of the relevant configuration parameters:

  1. swtpg_conf["readout"]["tpset_min_latency_ticks"] = 1250000
  2. swtpg_conf["readout"]["tardy_tp_quiet_time_at_start_sec"] = 0

Longer values for the tpset_min_latency_ticks will reduce the probability of demonstrating the problem. Longer values for tardy_tp_quiet_time_at_start_sec will reduce the number of warning messages that are produced when the problem occurs.

wesketchum commented 8 months ago

So, in trying to test this at NP04, I had to pull down the production/v4 branch of rcif in order to build properly. Then, in running, readout affinity wasn't applied?

Executing ['readout-affinity.py --pinfile ${DUNEDAQ_THREAD_PIN_FILE}'] script on 'np04-srv-001':
export DUNEDAQ_THREAD_PIN_FILE="/nfs/sw/dunedaq/18MarFDDevTardyTPChangeTesting/np04daq-configs/cpupin_files/cpupin-all-running.json"; export
LD_LIBRARY_PATH="/cvmfs/dunedaq.opensciencegrid.org/spack/externals...rvbv7wiq57sgvs5/lib:/home/fwComponents_np04rc/bin:"; export
PATH="/cvmfs/dunedaq.opensciencegrid.org/spack/externals.../nfs/home/np04daq/.local/bin:/nfs/home/np04daq/bin"; readout-affinity.py --pinfile
${DUNEDAQ_THREAD_PIN_FILE}
           ERROR    Traceback (most recent call last):                                                                                      sshpm.py:180
                      File "/nfs/sw/dunedaq/18MarFDDevTardyTPChangeTesting/install/readoutlibs/bin/readout-affinity.py", line 6, in
                    <module>
                        import psutil
                    ModuleNotFoundError: No module named 'psutil'

but I don't see how/why this change would affect that at all (also, import psutil from python works locally, so not sure).

In any case though, I can verify that the tp tardy messages don't appear until after 10 seconds from the start of the run, and that the message is the desired format. Given that I don't think the above should have anything to do with this and the other tests work, I'm marking this approved.