There are edge-cases of instance errors in which say the serratus-align container is not doing any meaningful work (measured by CPU%) and the shut-down procedures fail to catch and gracefully close the instance and container. We rely on ec2-terminate for this graceful shutdown but having a redundancy of sudo shutdown -h now or eqiuvalent function would be really nice.
One way to implement this is to add "health checks" for the instances, that is if CPU usage i say <5% for a sustained 5-10 minutes, the instance is terminated from outside. There are quite a few cases of serratus-align, serratus-dl and serratus-merge in which a few stragglers are left 'spooling' after scale-in or in the background during a run. This in theory will be a catch-all for several errors to reduce inefficiencies.
From serratus/containers/worker.sh
shutdown)
(
flock 200
echo " Shutting down instance"
# TODO: change to shutdown (see below)
aws ec2 terminate-instances \
--region us-east-1 \
--instance-ids $INSTANCE_ID
sleep 300
# TODO: Add a redundancy for shutdown
# to work form inside the container
#
# Secondary back-up -- shutdown instance
# (set to "stopped" state" if terminate fails)
# yum -y install sudo shadow-utils util-linux
# sudo shutdown -h now
# sleep 300
false
exit 0
) 200> "$BASEDIR/.shutdown-lock"
There are edge-cases of instance errors in which say the
serratus-align
container is not doing any meaningful work (measured by CPU%) and the shut-down procedures fail to catch and gracefully close the instance and container. We rely onec2-terminate
for this graceful shutdown but having a redundancy ofsudo shutdown -h now
or eqiuvalent function would be really nice.One way to implement this is to add "health checks" for the instances, that is if CPU usage i say <5% for a sustained 5-10 minutes, the instance is terminated from outside. There are quite a few cases of
serratus-align
,serratus-dl
andserratus-merge
in which a few stragglers are left 'spooling' after scale-in or in the background during a run. This in theory will be a catch-all for several errors to reduce inefficiencies.From
serratus/containers/worker.sh