DUNE-DAQ / drunc

Dune RUN Control (DRUNC) is the run control for the DUNE experiment
1 stars 1 forks source link

Unable to run interactive drunc session with example config from np04daq account #269

Closed bieryAtFnal closed 1 month ago

bieryAtFnal commented 1 month ago

When I try to do this, for example on np04-srv-003, I see errors like ApplicationLookupUnsuccessful: Could not resolve the URI for 'root-controller_control' in the connectivity service, got response []

Here are instructions for reproducing the tests that I ran...

# as user np04daq...

DATE_PREFIX=`date '+%d%b'`
TIME_SUFFIX=`date '+%H%M'`

source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh
setup_dbt latest_v5
dbt-create -n NFD_DEV_241016_A9 ${DATE_PREFIX}FDDev_${TIME_SUFFIX}
cd ${DATE_PREFIX}FDDev_${TIME_SUFFIX}/sourcecode

git clone https://github.com/DUNE-DAQ/daqsystemtest.git -b plasorak/no-thread-pinning
cd ..

dbt-workarea-env
dbt-build -j 20
dbt-workarea-env

mkdir rundir
cd rundir

source ~/bin/web_proxy.sh -u

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config boot wait 5 conf wait 3 start 101 enable-triggers wait 10 disable-triggers drain-dataflow stop-trigger-sources stop scrap terminate

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config boot wait 5 conf wait 3 start 102 enable-triggers wait 10 disable-triggers drain-dataflow stop-trigger-sources stop scrap terminate
plasorak commented 1 month ago

Starting the drunc with log level = debug yields error messages like:

                      STDOUT:
                    bash: line 1: /nfs/home/np04daq/NFD_DEV_241016_A9_plasorak/log_np04daq_local-1x1-config_hsi-01.log: cannot overwrite existing file

                      STDERR:                                                                                                                                                                                                   @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
                    @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
                    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
                    IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
                    Someone could be eavesdropping on you right now (man-in-the-middle attack)!                                                                                                                                 It is also possible that a host key has just been changed.
                    The fingerprint for the ED25519 key sent by the remote host is
                    SHA256:MfI8N1nKxOfCgh1s2XcPpSvKblPN7V+WB5NbrWi8Afk.
                    Please contact your system administrator.
                    Add correct host key in /nfs/home/np04daq/.ssh/known_hosts to get rid of this message.                                                                                                                      Offending ED25519 key in /nfs/home/np04daq/.ssh/known_hosts:133

I'm afraid this isn't a problem with the run control.

bieryAtFnal commented 1 month ago

I haven't been able to observe the HOST IDENTIFICATION HAS CHANGED message, but I have seen the following:

   STDOUT:                                                                                           
                    bash: line 1:                                                                                       
                    /nfs/home/np04daq/.biery/dunedaq/16OctFDDev_2222/rundir/log_np04daq_local-1x1-config_df-01.log:     
                    cannot overwrite existing file                                                                      

                      STDERR:                                                                                           
                    Address ::1 maps to localhost, but this does not map back to the address.                           
                    Address ::1 maps to localhost, but this does not map back to the address.                           
                    Connection to localhost closed.      

The presence of the 'cannot overwrite existing file" message in both of our screen captures got me wondering if removing existing log files from the current working directory would help. It did!

I can get reliable operation if I delete the log files between each running of the drunc_unified_shell.

It might be interesting to temporarily remove the "set -o noclobber" line from the np04daq account .bashrc to see if that helps, but I'm not willing to try that without coordinating with other people.

I tried running an fddaq-v4.4.8 system from the np04daq account on np04-srv-003, and I did not have a problem with a second set of log files overwriting the first. Not sure what nanorc would have been doing differently...

plasorak commented 1 month ago

A way to fix the issue is to add --no-override-logs to the boot command. That will generate logs that are timestamped.

plasorak commented 1 month ago

I tried running an fddaq-v4.4.8 system from the np04daq account on np04-srv-003, and I did not have a problem with a second set of log files overwriting the first. Not sure what nanorc would have been doing differently...

Depends if you are using nanorc or nano04rc, I would expect to see the same behaviour with the former, but with the later, logs go to /logs and get a timestamp (which is another thing we need to fix in drunc).

bieryAtFnal commented 1 month ago

Closing this issue. Adding the suggested option to the boot command worked great.

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config boot --no-override-logs wait 5 conf wait 3 start 104 enable-triggers wait 10 disable-triggers drain-dataflow stop-trigger-sources stop scrap terminate