Apps persisting after being declared dead

eflumerf commented 1 year ago

I've been trying to run the integration tests on our snazzy new group machine at Fermilab, and I've noticed failures which appear to be due to the app starting before NanoRC thinks that it should be...

[11:08:00] INFO     ResponseListener: Flask joined                                                                                                                              appctrl.py:107
           INFO     listrev-app-gr process exited with exit code 255                                                                                                              sshpm.py:142
           INFO     listrev-app-v process exited with exit code 255                                                                                                               sshpm.py:142
                  Run #101 finished

...

[11:08:06] INFO     Using filelogbook                                                                                                                                               core.py:93
Running on the apparatus json2:
╭───────────╮
│ json2     │
│ └── json2 │
╰───────────╯
           INFO     'json2' received command 'boot'                                                                                                                        statefulnode.py:230
           INFO     Propagating to children nodes in the order: ['json2']                                                                                                  statefulnode.py:238
           INFO     Sending boot to json2                                                                                                                                  statefulnode.py:246
           INFO     Subsystem json2 is booting partition integtest-partition                                                                                                       node.py:206
'listrev-app-g' logs are in 'daq.fnal.gov:/tmp/pytest-of-eflumerf/pytest-16/run2/log_listrev-app-g_3334.txt'
'listrev-app-rv' logs are in 'daq.fnal.gov:/tmp/pytest-of-eflumerf/pytest-16/run2/log_listrev-app-rv_3333.txt'
           ERROR    ERROR: apps already running? ['listrev-app-g']   
                    RuntimeError: ERROR: apps already running? ['listrev-app-g']                                                                                                              
           ERROR    json2 went to error!                                                                                                                                   statefulnode.py:225
                    Couldn't boot json2                                                                                                                                                       
                    An error occured while executing boot                                                                                                                                     
                    An exception was thrown: ERROR: apps already running? ['listrev-app-g']

eflumerf commented 1 year ago

More testing shows that the real issue is that despite Nanorc reporting the application exited with a status code, it is actually still running, causing the subsequent run to have issues.

eflumerf commented 1 year ago

As far as I can tell, this is a very low-level issue with nanorc, in that instead of managing the apps themselves, it is managing the ssh connections to the apps, and assuming that if the connection is gone then the app must be dead. The issue here is that at least on the AlmaLinux 9 machine I'm testing with, the daq_applications hang around for a fairly significant amount of time after their SSH connection has been terminated.

I don't know if it is worth the effort at this point to fix nanorc to properly manage the daq_application processes (by obtaining their remote PIDs and monitoring using that, for example) or if a replacement RC system will be coming soon enough with this kind of thing in mind.

plasorak commented 1 year ago

Hi Eric, thanks for reporting this. Just so that I understand, it seems that the DAQ application doesn't disappear after the ssh process has been terminated.

"fairly significant amount of time" means that the application gets killed in the end or you have to wait for a couple of seconds before it stops?

I don't have access to an AlmaLinux 9 machine, unfortunately. Is one of the dunegpvm an AlmaLinux machine? I do think we should fix it.

plasorak commented 1 year ago

Could this also be related? https://github.com/DUNE-DAQ/nanorc/issues/164

eflumerf commented 1 year ago

So I do see nanorc reporting that it exited (in the logs I posted before, the listrev-app-g was conflicting with the still-running listrev-app-v that had been declared to be exited).

I see the applications exiting after 10-30 seconds, I don't think they're killed but rather they exit by themselves after noticing that their input has gone away. It's long enough that I have time to Ctrl-C the integration test, run ps, and connect gdb.

plasorak commented 1 year ago

But the apps won't stop unless they are killed, there's no way for them to notice that input goes away, they work like a rest API, for better or for worse. I'm suspecting this is some sort of automated feature that kills zombie processes.

eflumerf commented 1 year ago

I think I've eliminated AL9 as a culprit...running the tests on a VM does not show the same behavior. On daq.fnal.gov, it happens every time.

plasorak commented 1 year ago

Okay, so I've also experienced this problem. Every time I'm interrupting a run on a simple emulator DAQ setup, the readout application persists, even though it is declared dead, and the SSH process is killed.

I'm writing here what I found (sorry if this is messy):

Removing the SIGHUP signal handler in appfwk changes the behaviour. This is interesting because nanorc is able to send kill or terminate (SIGKILL or SIGTERM, respectively) to the ssh process (not the DAQ app), but because we feed -tt to the ssh command, this always gets translated to SIGHUP at the application level...
The TLOG in the daq_application.cxx (line 44) signal handler is useless in the case where the application runs with nanorc (it never makes it to the logs) because the ssh process itself has disappeared at this stage, and python cannot process the log. You can use the plasorak/end-info branch in appfwk to get ERS messages in Grafana (after doing the correct opmon/ers_impl).
Another interesting thing is that the daq_application returns!! You can see it in the logs on Grafana from this line. Still, there are readout processes in htop, and the application won't boot a second time due to port clash.
Another oddity is that this behaviour only happens only after enable_triggers. So if I quit nanorc after conf, no application persists. Similarly, it doesn't happen after boot.

plasorak commented 1 year ago

Closing this as Eric has merged his appfwk Signal handling PR.

DUNE-DAQ / nanorc

Apps persisting after being declared dead #178