Job doesn't seem to be properly `canceled`

t-reents commented 2 months ago

I made the observation that after calling hq job cancel <id> the job doesn't seem to be properly canceled. For this example, I manually start a worker by submitting a slurm submission script which specifies the allocation:

#!/bin/bash
#SBATCH --no-requeue
#SBATCH --job-name="HQ-dev-g"
#SBATCH --get-user-env
#SBATCH --output=_scheduler-stdout.txt
#SBATCH --error=_scheduler-stderr.txt
#SBATCH --partition=dev-g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --time=30
#SBATCH --mem=227328
#SBATCH --gres=gpu:8
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --hint=nomultithread

export OMP_PLACES=threads
export OMP_PROC_BIND=close
export MPICH_GPU_SUPPORT_ENABLED=1

# this exports $CPU_BIND and $OMP_NUM_THREADS given ntasks-per-node
....

/users/reentsti/bin/hq19 worker start --no-hyper-threading --manager slurm --heartbeat 10m --group test_group &

wait

The approach works per se and HQ runs perfectly fine (also my submissions within HQ). The actual jobs are submitted via the following script using hq job submit:

#!/bin/bash
#HQ --name="aiida-195142" --stdout=_scheduler-stdout.txt --stderr=_scheduler-stderr.txt --time-limit=1200s --cpus=56 --resource mem=200000

"srun" "-u" "-s" "--overlap" 'pwx-wrapper' '-npool' '8' '-in' 'aiida.in'  > "aiida.out"

(I need to use srun, similar to #443, and the problem occurs independent of the chosen srun options). In general, running the jobs like this works well. However, when cancelling the job via hq job cancel, the slurm job steps are still listed as running:

sacct -j 7694006
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
7694006        HQ-dev-g      dev-g project_4+        112    RUNNING      0:0 
7694006.bat+      batch            project_4+        112    RUNNING      0:0 
7694006.0    pwx-wrapp+            project_4+        112    RUNNING      0:0

Moreover, when resubmitting the HQ job, I observe a worse performance, as the allocation still seems to be busy. The following output is generated after resubmission:

sacct -j 7694006
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
7694006        HQ-dev-g      dev-g project_4+        112    RUNNING      0:0 
7694006.bat+      batch            project_4+        112    RUNNING      0:0 
7694006.0    pwx-wrapp+            project_4+        112    RUNNING      0:0 
7694006.1    pwx-wrapp+            project_4+        112    RUNNING      0:0

If I don't use the --overlap option in the srun command, I'll also receive the srun: Job 7693701 step creation temporarily disabled, retrying (Requested nodes are busy) warning/error (similar to #443) which kind of confirms that the previous steps are not fully cancelled.

Any help would be much appreciated!

Kobzol commented 2 months ago

Hi, when you cancel a HQ job, the following happens:

HQ marks that job (and all its tasks) as being canceled
If the task was running, the worker sends it SIGINT
If the task does not exit within one second, send it SIGKILL

That's about all that HQ can do from user-space. If the srun job step does not react to these two signals (and indeed from my experience, Slurm job steps can be very.. finicky around signals), then it is possible that the process will continue running, even though HQ no longer treats it as running.

In general, I would suggest to avoid running srun inside the submitted HQ tasks, unless you specifically need it for multi-node MPI execution.

t-reents commented 2 months ago

Hi @Kobzol! Thanks a lot for your quick reply!

I was already expecting something like that. Unfortunately, we need this multi-node use-case. Nonetheless, thank you for the clarification.

It4innovations / hyperqueue

Job doesn't seem to be properly `canceled` #722