DUNE-DAQ / fdreadoutlibs

fdreadoutlibs
0 stars 3 forks source link

The stop_trigger_sources transition can take multiple minutes in a simple demo system #216

Open bieryAtFnal opened 3 weeks ago

bieryAtFnal commented 3 weeks ago

I'm not sure whether this repo (fdreadoutlibs) is the right place to file this Issue, but I wanted to document the fact that the stop_trigger_sources transition often takes ~2 minutes to complete when using the simple system configuration that is currently available in the appmodel repo and a recent nightly build of the software.

Here are sample steps to demonstrate the issue:

DATE_PREFIX=`date '+%d%b'`
TIME_SUFFIX=`date '+%H%M'`

source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest_v5
dbt-create -n NFD_DEV_240820_A9 ${DATE_PREFIX}20AugNightlyFDv5Test_${TIME_SUFFIX}
cd ${DATE_PREFIX}20AugNightlyFDv5Test_${TIME_SUFFIX}/sourcecode

git clone https://github.com/DUNE-DAQ/appmodel.git -b develop
cd appmodel ; git checkout 4e34d68adc4; cd ..
git clone https://github.com/DUNE-DAQ/fdreadoutlibs.git -b develop
cd fdreadoutlibs; git checkout ab0d71cc4f9; cd ..
git clone https://github.com/DUNE-DAQ/fdreadoutmodules.git -b develop
cd fdreadoutmodules; git checkout 105a58f82cc2; cd ..
cd ..

sed -i 's/unpack_one_register(second_half)/unpack_one_register(first_half)/' sourcecode/fdreadoutlibs/include/fdreadoutlibs/wibeth/tpg/FrameExpand.hpp 

dbt-workarea-env
dbt-build -j 12
dbt-workarea-env

mkdir rundir
cd rundir

# Execute the following commands by hand:

killall drunc-controller
drunc-unified-shell ssh-standalone

# within drunc

boot test/config/test-session.data.xml test-session
fsm conf
fsm start run_number 101
fsm enable_triggers
# wait for a few seconds
fsm disable_triggers
fsm drain_dataflow

# note how long the next step takes
fsm stop_trigger_sources

fsm stop
fsm scrap
exit

I believe that this long interval was not present in nightly builds as recent as 17-Aug.

bieryAtFnal commented 3 weeks ago

I just noticed the following output from drunc:

Sending stop.
[13:18:53] ERROR    "controller_driver": Command 'execute_fsm_command' failed on 'ru-01' (response flag                 shell_utils.py:138
                    'UNHANDLED_EXCEPTION_THROWN')                                                                                         
           ERROR    "controller_driver": Exception thrown from child: Stacktrace [bold red]on remote server![/bold red] shell_utils.py:123
                    Traceback (most recent call last):                                                                                    
                      File                                                                                                                
                    "/home/nfs/biery/dunedaq/20Aug20AugNightlyFDv5Test_1311/.venv/lib/python3.10/site-packages/drunc/co                   
                    ntroller/controller.py", line 290, in propagate_to_child                                                              
                        response = child.propagate_command(command, command_data, token)                                                  
                      File                                                                                                                
                    "/home/nfs/biery/dunedaq/20Aug20AugNightlyFDv5Test_1311/.venv/lib/python3.10/site-packages/drunc/co                   
                    ntroller/children_interface/rest_api_child.py", line 522, in propagate_command                                        
                        exit_state = exit_state.upper(),                                                                                  
                    AttributeError: 'NoneType' object has no attribute 'upper'                                                            

                    AttributeError: 'NoneType' object has no attribute 'upper'                                                            
                     ru-01 -> 6                                                                                                           

                    stop execution report                    

This confirms Giovanna's sense that the problem is in the Readout App...