DUNE-DAQ / drunc

Dune RUN Control (DRUNC) is the run control for the DUNE experiment
1 stars 1 forks source link

root-controller fails to start properly #304

Open eflumerf opened 1 week ago

eflumerf commented 1 week ago
'root-controller' (29447e30-51a6-480d-9bc6-3e6882965e3e) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "df-controller"                                                                                                             
                    tree_id: "1.1.0"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted df-controller uid: ccb0fd88-957f-4e9f-a09e-da4ab9613591            
'df-controller' (ccb0fd88-957f-4e9f-a09e-da4ab9613591) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "dfo-01"                                                                                                                    
                    tree_id: "1.1.1"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted dfo-01 uid: e5356761-c163-49ea-a321-47833775d868                   
'dfo-01' (e5356761-c163-49ea-a321-47833775d868) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "df-01"                                                                                                                     
                    tree_id: "1.1.2"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted df-01 uid: 32b82612-79a7-4865-b96a-9a123b447236                    
'df-01' (32b82612-79a7-4865-b96a-9a123b447236) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "trg-controller"                                                                                                            
                    tree_id: "1.2.0"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted trg-controller uid: 648a6cfe-a119-44fe-91d0-c2054714ea4a           
'trg-controller' (648a6cfe-a119-44fe-91d0-c2054714ea4a) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "mlt"                                                                                                                       
                    tree_id: "1.2.1"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted mlt uid: db001912-7d62-4db8-be94-dc946e676576                      
'mlt' (db001912-7d62-4db8-be94-dc946e676576) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "ru-controller"                                                                                                             
                    tree_id: "1.3.0"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted ru-controller uid: a501ce27-8c96-4e72-9723-cd7664912bb3            
'ru-controller' (a501ce27-8c96-4e72-9723-cd7664912bb3) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "ru-det-conn-2"                                                                                                             
                    tree_id: "1.3.1"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted ru-det-conn-2 uid: a0d72f76-239d-48a1-b973-0aa7a861de1c            
'ru-det-conn-2' (a0d72f76-239d-48a1-b973-0aa7a861de1c) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "ru-det-conn-1"                                                                                                             
                    tree_id: "1.3.2"                                                                                                                  

[03:20:31] INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted ru-det-conn-1 uid: 1a62d5e6-1577-461b-aaee-f592b4699dc0            
'ru-det-conn-1' (1a62d5e6-1577-461b-aaee-f592b4699dc0) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "dunedaq"                                                   
                    session: "3ru1df"                                                                                                                 
                    name: "ru-det-conn-0"                                                                                                             
                    tree_id: "1.3.3"                                                                                                                  

           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted ru-det-conn-0 uid: 5b70e144-5aae-45b5-ae48-31102845a102            
'ru-det-conn-0' (5b70e144-5aae-45b5-ae48-31102845a102) process started
⠙ Looking for 'root-controller' on the connectivity service... ━━━━╸ 0:0… 0:01:…
[03:21:31] ERROR    process_manager_driver.py:213   process_manager_driver:                                                                           
                    Could not find 'root-controller' on the connectivity service.                                                                     

                    Two possibilities:                                                                                                                

                    1. The most likely, the controller died. You can check that by looking for error like:                                            
                    Process 'root-controller' (session: '3ru1df', user: 'dunedaq') process exited with exit code 1).                                  
                    Try running ps to see if the root-controller is still running.                                                                    
                    You may also want to check the logs of the controller, try typing:                                                                
                    logs --name root-controller --how-far 1000                                                                                        
                    If that's not helping, you can restart this shell with --log-level debug, and look out for 'STDOUT' and 'STDERR'.                 

                    2. The controller did not die, but is still setting up and has not advertised itself on the connection service.                   
                    You may be able to connect to the root-controller in a bit. Check the logs of the controller:                                     
                    logs --name root-controller --grep grpc                                                                                           
                    And look for messages like:                                                                                                       
                    Registering root-controller to the connectivity service at grpc://xxx.xxx.xxx.xxx:xxxxx                                           
                    To find the controller address, you can look up 'root-controller_control' on http://daq.fnal.gov:37667/ (you may need a SOCKS proxy
                    from outside CERN), or use the address from the logs as above. Then just connect this shell to the controller with:               
                    connect {controller_address}:{controller_port}>                                                                                   

[03:21:32] ERROR    shell_utils.py:295      unified:        Could not understand where the controller is!                                             
Running transition 'conf' on controller 'root-controller'
           ERROR    shell_utils.py:276      unified:        Controller-specific commands cannot be sent until the session is booted                   
---------- DRUNC Run END ----------
eflumerf commented 1 week ago

This issue has been seen in multiple contexts, including the automated regression tests https://github.com/DUNE-DAQ/daq-release/actions/runs/11755085773/job/32749783614#step:2:137. Usually self-resolves, but it would be good to understand why it is failing to start correctly.

plasorak commented 1 week ago

Just to make sure I understand, there has not been some changes in the configuration? Is the connectivity server being started by the integtest or drunc?

plasorak commented 1 week ago

Unfortunately, if any application fails to starts, then the controller in charge of it cannot find it in the connectivity service, so it dies too. Finally the root-controller looks up that controller on the connectivity service, that fails, and so the root-controller also dies. In the short term, I suggest putting a ps straight after boot in the integtests, to see which applications are not booting correctly. In the long term, I guess the question is what exactly should happen to the control tree when a daq app fails to start correctly. I think what we want here is the controller still starting, but in error.

PawelPlesniak commented 1 week ago

Additionally if root-controller fails the Trying to talk to top controller is not interruptible. Should be a fix similar to interruptible boot