DUNE-DAQ / drunc

Dune RUN Control (DRUNC) is the run control for the DUNE experiment
1 stars 1 forks source link

Die quicker, controller #265

Closed plasorak closed 1 month ago

plasorak commented 1 month ago

This PR corrects a typo in the data_type for the RunControlMessage. More importantly, it changes the behaviour if we can't reach the connectivity server on retract: if that's the case (meaning the connectivity server has probably been killed before the controller), we abort.

Fixes https://github.com/DUNE-DAQ/drunc/issues/204, again

bieryAtFnal commented 1 month ago

I'm not sure if I understand what I should see with these changes...

Without them, I see drunc-controller processes hang around for about 30 seconds after drunc exits when I run an interactive DAQ session on daq.fnal.gov.

WIth them, the drunc-controller processes hang around for less than 10 seconds, but they are still there when drunc exits.

Am I looking at the wrong thing?
If not, shouldn't success be indicated by no drunc-controller processes running when the drunc-interactive-shell exits?

plasorak commented 1 month ago

You are not, on the np04 cluster, they still exist for around 2 seconds.

I don't think it's trivial to add a check to make sure there is not process when drunc exits, the process manager sends sighup to the processes when it exits, but it does not track their PID.

bieryAtFnal commented 1 month ago

Thanks for the update.

I think that it's very important to not leave processes hanging around when run control exits. Should I file a separate Issue for that?

plasorak commented 1 month ago

Yeah, I think so, this is quite a bit more complicated than what I envisaged.