Closed eflumerf closed 1 year ago
More testing shows that the real issue is that despite Nanorc reporting the application exited with a status code, it is actually still running, causing the subsequent run to have issues.
As far as I can tell, this is a very low-level issue with nanorc
, in that instead of managing the apps themselves, it is managing the ssh connections to the apps, and assuming that if the connection is gone then the app must be dead. The issue here is that at least on the AlmaLinux 9 machine I'm testing with, the daq_application
s hang around for a fairly significant amount of time after their SSH connection has been terminated.
I don't know if it is worth the effort at this point to fix nanorc
to properly manage the daq_application
processes (by obtaining their remote PIDs and monitoring using that, for example) or if a replacement RC system will be coming soon enough with this kind of thing in mind.
Hi Eric, thanks for reporting this. Just so that I understand, it seems that the DAQ application doesn't disappear after the ssh process has been terminated.
"fairly significant amount of time" means that the application gets killed in the end or you have to wait for a couple of seconds before it stops?
I don't have access to an AlmaLinux 9 machine, unfortunately. Is one of the dunegpvm an AlmaLinux machine? I do think we should fix it.
Could this also be related? https://github.com/DUNE-DAQ/nanorc/issues/164
So I do see nanorc reporting that it exited (in the logs I posted before, the listrev-app-g was conflicting with the still-running listrev-app-v that had been declared to be exited).
I see the applications exiting after 10-30 seconds, I don't think they're killed but rather they exit by themselves after noticing that their input has gone away. It's long enough that I have time to Ctrl-C the integration test, run ps, and connect gdb.
But the apps won't stop unless they are killed, there's no way for them to notice that input goes away, they work like a rest API, for better or for worse. I'm suspecting this is some sort of automated feature that kills zombie processes.
I think I've eliminated AL9 as a culprit...running the tests on a VM does not show the same behavior. On daq.fnal.gov
, it happens every time.
Okay, so I've also experienced this problem. Every time I'm interrupting a run on a simple emulator DAQ setup, the readout application persists, even though it is declared dead, and the SSH process is killed.
I'm writing here what I found (sorry if this is messy):
SIGHUP
signal handler in appfwk changes the behaviour. This is interesting because nanorc is able to send kill or terminate (SIGKILL
or SIGTERM
, respectively) to the ssh process (not the DAQ app), but because we feed -tt
to the ssh command, this always gets translated to SIGHUP
at the application level...TLOG
in the daq_application.cxx
(line 44) signal handler is useless in the case where the application runs with nanorc (it never makes it to the logs) because the ssh process itself has disappeared at this stage, and python cannot process the log. You can use the plasorak/end-info
branch in appfwk to get ERS messages in Grafana (after doing the correct opmon/ers_impl).enable_triggers
. So if I quit
nanorc after conf, no application persists. Similarly, it doesn't happen after boot.Closing this as Eric has merged his appfwk Signal handling PR.
I've been trying to run the integration tests on our snazzy new group machine at Fermilab, and I've noticed failures which appear to be due to the app starting before NanoRC thinks that it should be...
...