The hibinit-agent does not gracefully terminate on shutdown.
As we have DefaultDependencies=no in hibinit-agent.service, our process may be abruptly killed during a critical process (e.g. dracut).
This can lead to issues various issues on resume; At worse, the root fs may fail to mount correctly.
Description of changes:
Hibinit-agent.service
Added a 2 minute timeout when the service is requested to stop.
This should give our service ample time to complete any critical process in progress and exit gracefully.
Hibinit-agent
SIGTERM will now check if the global boolean CRITICAL_PROCESS_IN_PROGRESS is true.
If so, the agent will not exit until said process is completed. (Indicated by another SHUTDOWN_REQUESTED global boolean)
If no critical process is running, we will attempt to terminate the agent after attempting to remove the swap file & ensuring it is not swapon.
These changes were both manually tested for each supported OS.
In addition, each passed our 99.9 percent success rate in batch end-to-end testing
OS
Manually Tested?
TotalRuns
RunsSucceeded
RunsFailed
Percentage
AL2
Yes
10000
9999
1
99.99
AL2023
Yes
1000
1000
0
100
Ubuntu Focal
Yes
1000
1000
0
100
Ubuntu Jammy
Yes
1000
1000
0
100
RHEL8
Yes
1000
1000
0
100
RHEL9
Yes
1000
1000
0
100
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue:
The hibinit-agent does not gracefully terminate on shutdown. As we have
DefaultDependencies=no
inhibinit-agent.service
, our process may be abruptly killed during a critical process (e.g. dracut).This can lead to issues various issues on resume; At worse, the root fs may fail to mount correctly.
Description of changes:
Hibinit-agent.service
Added a 2 minute timeout when the service is requested to stop. This should give our service ample time to complete any critical process in progress and exit gracefully.
Hibinit-agent
SIGTERM will now check if the global boolean
CRITICAL_PROCESS_IN_PROGRESS
is true. If so, the agent will not exit until said process is completed. (Indicated by anotherSHUTDOWN_REQUESTED
global boolean)If no critical process is running, we will attempt to terminate the agent after attempting to remove the swap file & ensuring it is not
swapon
.These changes were both manually tested for each supported OS. In addition, each passed our 99.9 percent success rate in batch end-to-end testing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.