aws / amazon-ec2-hibinit-agent

Apache License 2.0
18 stars 20 forks source link

Critical termination protection #36

Closed kjeffsh closed 10 months ago

kjeffsh commented 10 months ago

Issue:

The hibinit-agent does not gracefully terminate on shutdown. As we have DefaultDependencies=no in hibinit-agent.service, our process may be abruptly killed during a critical process (e.g. dracut).

This can lead to issues various issues on resume; At worse, the root fs may fail to mount correctly.

Description of changes:

Hibinit-agent.service

Added a 2 minute timeout when the service is requested to stop. This should give our service ample time to complete any critical process in progress and exit gracefully.

Hibinit-agent

SIGTERM will now check if the global boolean CRITICAL_PROCESS_IN_PROGRESS is true. If so, the agent will not exit until said process is completed. (Indicated by another SHUTDOWN_REQUESTED global boolean)

If no critical process is running, we will attempt to terminate the agent after attempting to remove the swap file & ensuring it is not swapon.

These changes were both manually tested for each supported OS. In addition, each passed our 99.9 percent success rate in batch end-to-end testing

OS Manually Tested? TotalRuns RunsSucceeded RunsFailed Percentage
AL2 Yes 10000 9999 1 99.99
AL2023 Yes 1000 1000 0 100
Ubuntu Focal Yes 1000 1000 0 100
Ubuntu Jammy Yes 1000 1000 0 100
RHEL8 Yes 1000 1000 0 100
RHEL9 Yes 1000 1000 0 100

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.