canonical / testflinger

https://testflinger.readthedocs.io/en/latest/
GNU General Public License v3.0
11 stars 17 forks source link

Trigger an agent restart if USR1 signal is received #346

Closed plars closed 1 month ago

plars commented 1 month ago

So I'm proposing this one against main rather than the feature branch.

Description

We already have a “safe restart” mechanism for the testflinger agent. This works by waiting until the agent is not running a job, then checking for a marker file that signals the agent to restart. This way, individual agents can be safely updated, then told to restart whenever they are done with their current jobs, so that we avoid a hard-restart that interrupts a job in progress.

Since supervisord gives us a nice mechanism for sending signals to all processes, we should take advantage of that by changing testflinger-agent to also respond to a signal as a trigger that it needs to perform a safe-restart. It’s common to use HUP to signal that something should reload it’s configuration files, but I’m not sure that really fits here. That’s normally just a “reread” of the configs, but in this case we’re telling it to completely exit so that it not only rereads the configs but also reloads the new version of the code that was potentially installed. So I would suggest using USR1 for the signal in this case.

Resolved issues

CERTTF-379

Documentation

Added documentation in the explanation section about this, and other methods for safely shutting down or restarting an agent

Web service API changes

N/A

Tests

Tested locally and also added a new test that executes the real agent, triggers a reset with the signal, and ensure it takes all the right steps.