So I'm proposing this one against main rather than the feature branch.
Description
We already have a “safe restart” mechanism for the testflinger agent. This works by waiting until the agent is not running a job, then checking for a marker file that signals the agent to restart. This way, individual agents can be safely updated, then told to restart whenever they are done with their current jobs, so that we avoid a hard-restart that interrupts a job in progress.
Since supervisord gives us a nice mechanism for sending signals to all processes, we should take advantage of that by changing testflinger-agent to also respond to a signal as a trigger that it needs to perform a safe-restart. It’s common to use HUP to signal that something should reload it’s configuration files, but I’m not sure that really fits here. That’s normally just a “reread” of the configs, but in this case we’re telling it to completely exit so that it not only rereads the configs but also reloads the new version of the code that was potentially installed. So I would suggest using USR1 for the signal in this case.
Resolved issues
CERTTF-379
Documentation
Added documentation in the explanation section about this, and other methods for safely shutting down or restarting an agent
Web service API changes
N/A
Tests
Tested locally and also added a new test that executes the real agent, triggers a reset with the signal, and ensure it takes all the right steps.
So I'm proposing this one against main rather than the feature branch.
Description
We already have a “safe restart” mechanism for the testflinger agent. This works by waiting until the agent is not running a job, then checking for a marker file that signals the agent to restart. This way, individual agents can be safely updated, then told to restart whenever they are done with their current jobs, so that we avoid a hard-restart that interrupts a job in progress.
Since supervisord gives us a nice mechanism for sending signals to all processes, we should take advantage of that by changing testflinger-agent to also respond to a signal as a trigger that it needs to perform a safe-restart. It’s common to use HUP to signal that something should reload it’s configuration files, but I’m not sure that really fits here. That’s normally just a “reread” of the configs, but in this case we’re telling it to completely exit so that it not only rereads the configs but also reloads the new version of the code that was potentially installed. So I would suggest using USR1 for the signal in this case.
Resolved issues
CERTTF-379
Documentation
Added documentation in the explanation section about this, and other methods for safely shutting down or restarting an agent
Web service API changes
N/A
Tests
Tested locally and also added a new test that executes the real agent, triggers a reset with the signal, and ensure it takes all the right steps.