GoogleCloudPlatform / slurm-gcp

Apache License 2.0
17 stars 19 forks source link

Add graceful VM shutdown #182

Open casassg opened 6 days ago

casassg commented 6 days ago

Currently when a machine is deleted, slurm step is interrupted without warning. However it would be great to send to all slurm steps within the machine a SIGINT such that they can run code to clean up (copy state into GCS for example)

Specially significant for Spot VMs.

I have not been able to find wether slurm currently handles it well.

bliklabs commented 6 days ago

Someone will have to correct me if I'm wrong, but each compute node should be able to query to controller daemon via slurm.conf and squeue. So in theory:

squeue -w $(hostname) -h -o "%.18i" | xargs -I {} scancel --signal=INT {}

Coupled with:

https://cloud.google.com/compute/docs/instances/create-use-spot#handle-preemption

Could work

casassg commented 4 days ago

@bliklabs yeah that would def work. it does need however to set RequeueExit=130 on slurm.conf as otherwise the job won't be requeued and it loses meaning to do so

bliklabs commented 4 days ago

Good point. Also the scancel brace might be an issue on line length. Likely it's better to: {}\;. I think scancel takes multiple id's delim via ',' so potentially additional string formating is required if you want to pass multiline jobid to a single scancel.

Also, maybe something to consider is to find the respective step pids locally via cgroup slice and: | xargs -I {} kill -2 {}\; this would reduce requests to the controller.

Also might be good to check the behavior of slurmd during this type of process, it could still be polling for work if it's not in drain. Curious what would happen if you masked slurmd then sigint slurmd.

casassg commented 2 days ago

so after some experimentation on our staging cluster I think I have something which may work well for now:

#!/bin/bash

set -euxo pipefail

# Send SIGTERM to all jobs, both childer and batch script.
# We use SIGTERM as its the same sent by Slurm when a job is preempted.
# https://slurm.schedmd.com/scancel.html
# https://slurm.schedmd.com/preempt.html
echo "Shutting down Slurm jobs on $(hostname), sending SIGUSR2/SIGTERM to all jobs..."
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGTERM --full {}
# We send SIGUSR2 to make sure also submitit jobs are handled well.
# https://github.com/facebookincubator/submitit/blob/07f21fa1234e34151874c00d80c345e215af4967/submitit/core/job_environment.py#L152
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGUSR2 --full {}

# Mark node for power down as soon as possible
echo "Marking node $(hostname) for power down to avoid slurm not seeing it..."
scontrol update nodename="$(hostname)" state=power_down reason="Node is shutting down/preempted"

# We wait here for slurmd ideally shutting down gracefully (jobs in node exit + slurm shuts down node)
# preventing spot instance to be stopped as much as possible.
SLURMD_PID="$(pgrep -n slurmd)"
while kill -0 "$SLURMD_PID"; do
   sleep 1
done

this together with this fragment in the initialization script:

echo "Installing shutdown script..."
chmod +x /opt/local/slurm/shutdown_slurm.sh
# Based on Google's shutdown script service and https://github.com/GoogleCloudPlatform/slurm-gcp/issues/182
cat <<EOF > /lib/systemd/system/slurm-shutdown.service
[Unit]
Description=Slurm Shutdown Service
Wants=network-online.target rsyslog.service
After=network-online.target rsyslog.service

[Service]
Type=oneshot
ExecStart=/bin/true
RemainAfterExit=true
# This service does nothing on start, and runs shutdown scripts on stop.
ExecStop=/opt/local/slurm/shutdown_slurm.sh
TimeoutStopSec=0
KillMode=process

[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl --no-reload --now enable /lib/systemd/system/slurm-shutdown.service

Doing this seems to be okay as we mark node to be drainer (update node status), and send the TERM signal.

I ended up doing SIGTERM to match up default behaviour from Slurm when preempting a job due to priority (https://slurm.schedmd.com/preempt.html)

Slurm.conf part which is needed:

# REQUEUE AND PREEMPTION
# Allow requeuing up to 5 times.
JobRequeue=1
MaxBatchRequeue=5
# Requeue jobs which have been interrupted by a node preemption or job preemption (SIGTERM)
# - Node preemption handled by shutdown_slurm.sh.
# - Job preemption handled https://slurm.schedmd.com/preempt.html
RequeueExit=143 # 128 + 15 = 143
# Add this here for submitit to work as expected
PreemptParameters=send_user_signal
PreemptMode=REQUEUE
PreemptType=preempt/qos

I decided to do SIGTERM vs SIGINT as SIGINT denotes user sending it, where term is a bit more automatic-related?

This can be used in a sbatch script as such:

#!/bin/bash
#SBATCH --requeue
#SBATCH --cpus-per-task 1
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
set -euxo pipefail

CHDIR="$(pwd)"
TMP_DIR=/tmp
JOB_ID="${SLURM_JOBID:-0}"

sig_handler()
{
  echo "Got SIGTERM, saving state"
  wait # wait for all children, this is important!
  mv "$TMP_DIR/times-$JOB_ID.txt" "$CHDIR"
  # Exit code 143 is the default for SIGTERM. This ensures we get rescheduled even if we handled it well.
  # Slurm will always reschedule jobs which have been terminated with 143.
  exit 143
}
# trap SIGTERM
trap 'sig_handler' SIGTERM

# create file if it doesn't exist
if [ ! -f "./times-$JOB_ID.txt" ]; then
  touch "times-$JOB_ID.txt"
fi

cd "$TMP_DIR"
cp "$CHDIR/times-$JOB_ID.txt" .
date >> "./times-$JOB_ID.txt"
srun --jobid "$SLURM_JOBID" bash -c 'sleep 300'

echo "All done!"
bliklabs commented 1 day ago

Great solution. The only suggestion i have is a nit, I'm thinking some type of explicit error handling instead of set -e is preferrable. As it stands, solid solution. And I actually prefer the last job based solution. It simplifies the implementation and error handling logic as templatable config for each job. Taking the time and effort to develope DAG workflows with frail retry logic given a specific multifactor implementation of slurm scheduling is rewarding.

casassg commented 1 day ago

Alright so after some weird debugging and issues reproducing my existing working test, I found a slight issue which may be useful to document to future users (and maybe potentially add this to slurm-gcp repo?)

So issue was that the above service definition meant that slurm-shutdown.service could be stopped after slurmd. In which case scancel cant be sent to the tasks. Separately, seems slurmd shuts down without notifying controller node which leads to controller not knowing wether jobs need to be preempted until the node fully disappears (which slurm_sync.py sees as hostname is non-reachable).

So modifications needed:

/opt/local/slurm/shutdown_slurm.sh

#!/bin/bash

# Send SIGTERM to all jobs, both childer and batch script.
# We use SIGTERM as its the same sent by Slurm when a job is preempted.
# https://slurm.schedmd.com/scancel.html
# https://slurm.schedmd.com/preempt.html
echo "Shutting down Slurm jobs on $(hostname), sending SIGUSR2/SIGTERM to all jobs..."
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGTERM --full {}
# We send SIGUSR2 to make sure also submitit jobs are handled well.
# https://github.com/facebookincubator/submitit/blob/07f21fa1234e34151874c00d80c345e215af4967/submitit/core/job_environment.py#L152
squeue -w "$(hostname)" -h -o "%.18i" | xargs -I {} scancel --signal=SIGUSR2 --full {}

# Mark node for power down as soon as possible. Note that it's okay to do this here as Slurm 
# will still allow jobs to finish, but will not schedule new jobs on this node.
echo "Marking node $(hostname) for power down to avoid slurm not seeing it..."
scontrol update nodename="$(hostname)" state=power_down reason="Node is shutting down/preempted"

# We wait here for slurmd ideally shutting down gracefully (jobs in node exit + slurm shuts down node)
# preventing spot instance to be stopped as much as possible.
echo "Waiting for slurmstepd to stop: $(pgrep 'slurmstepd')"
while pkill -0 "slurmstepd"; do
   sleep 1
done
if [ -f /opt/local/slurm/shutdown_slurm.sh ]; then
    echo "Setting up shutdown service..."
    chmod +x /opt/local/slurm/shutdown_slurm.sh
    # Based on Google's shutdown script service and https://github.com/GoogleCloudPlatform/slurm-gcp/issues/182
    cat <<EOF > /lib/systemd/system/slurm-shutdown.service
[Unit]
Description=Slurm Shutdown Service
# we need to run before slurmd is stopped
After=slurmd.service network-online.target
Wants=slurmd.service network-online.target

[Service]
Type=oneshot
ExecStart=/bin/true
RemainAfterExit=true
# This service does nothing on start, and runs shutdown scripts on stop.
ExecStop=/opt/local/slurm/shutdown_slurm.sh
TimeoutStopSec=0
KillMode=process

[Install]
WantedBy=multi-user.target
EOF
# Force services to stop quickly
mkdir -p /etc/systemd/system.conf.d
cat <<EOF > /etc/systemd/system.conf.d/10-timeout-stop.conf
[Manager]
DefaultTimeoutStopSec=2s
EOF
    systemctl daemon-reload
    systemctl --now enable /lib/systemd/system/slurm-shutdown.service
    echo "Shutdown service set up."
fi

Adds dependency of service to run after (and therefore stop before) slurmd.service. This ensures we can run w slurmd.

Also added a shorter default timeout as the default (90s) is literally the whole shut down period allowed by GCP. This may not be needed but I added it as I tried debugging why it didnt work before (I assumed it was due to some service slowing down the rest of stopping services). Hard to debug due to logs not showing up on GCP logging console, so had to take a guess and it seems to work now.

Also modified test script:

#!/bin/bash
#SBATCH --requeue             # This will requeue the job if preempted.
#SBATCH --cpus-per-task 1     # Only run with one CPU
#SBATCH --ntasks-per-node=1   # 1 task only
#SBATCH --nodes=1             # 1 node only
#SBATCH --time=6:00           # timeout after 6 minutes

set -uxo pipefail

CHDIR="$(pwd)"
TMP_DIR=/tmp
JOB_ID="${SLURM_JOBID:-0}"

### PREEMPTION HANDLING ###
sig_handler()
{
  echo "Got SIGTERM, saving state"
  mv "$TMP_DIR/times-$JOB_ID.txt" "$CHDIR"
  # Exit code 143 is the default for SIGTERM. 
  # Slurm will reschedule jobs with --requeue and exit code 143.
  exit 143
}
# trap SIGTERM and call sig_handler
trap 'sig_handler' SIGTERM

### JOB SETUP ###
# SLURM_RESTART_COUNT is the number of times the job has been restarted.
# You can use it avoid doing some operations if the job is restarted like reloading checkpoints.
RESTART_COUNT="${SLURM_RESTART_COUNT:-0}"
echo "Running job $JOB_ID. Restart count: $RESTART_COUNT"
# create file if it doesn't exist
if [ ! -f "$CHDIR/times-$JOB_ID.txt" ]; then
  touch "$CHDIR/times-$JOB_ID.txt"
fi

cd "$TMP_DIR"
cp "$CHDIR/times-$JOB_ID.txt" .

### JOB LOGIC ###
date >> "./times-$JOB_ID.txt"
# Note here we run on background, so that our signal handler can be called.
srun --jobid "$SLURM_JOBID" bash -c 'sleep 300' &

# Let's wait for signals or end of all background commands
wait

### JOB CLEANUP ###
# This runs if the job didn't get preempted.
echo "All done!"
mv "$TMP_DIR/times-$JOB_ID.txt" "$CHDIR"