Closed n-gao closed 2 years ago
For the second solution, I guess we'd need to know exactly which jobs were meant to be cancelled and check their status. here we create a list of Slurm IDs to be cancelled. We could run
squeue -u user -t RUNNING,PENDING,COMPLETING,CONFIGURING,RESIZING,SUSPENDED
to get the IDs of all jobs by the user which are still not completed / failed. This returns something like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6718258_18 gpu_all jobname user R 5:04:07 1 gpu14
6718258_19 gpu_all jobname user R 5:04:07 1 gpu14
6718258_0 gpu_all jobname user R 5:27:16 1 gpu20
6718258_1 gpu_all jobname user R 5:27:16 1 gpu20
6718258_2 gpu_all jobname user R 5:27:16 1 gpu15
6718258_3 gpu_all jobname user R 5:27:16 1 gpu08
6718258_4 gpu_all jobname user R 5:27:16 1 gpu08
6718258_5 gpu_all jobname user R 5:27:16 1 gpu08
6718258_9 gpu_all jobname user R 5:27:16 1 gpu09
6718258_10 gpu_all jobname user R 5:27:16 1 gpu10
6718258_11 gpu_all jobname user R 5:27:16 1 gpu10
We could then extract the JOBID
for each of these jobs and see if the intersection with the ones that should have been cancelled is empty. What do you think?
I agree, this is what I had in mind to circumvent this issue. A way to make it even simpler is to use:
squeue -n <comma separated list of job_ids> -o %A
This way the output is a list of non-finished job ids (+1 line header).
JOBID
6720463
6720464
6720465
6720466
6720462
6719977
6719978
6719979
6719980
6719976
6718733
6718734
6718735
6718736
6718732
Chaining commands via
seml <collection> [commands]
may have unintended side effects, especially when canceling jobs. It may happen that the following commands get executed before all jobs have been completely canceled byslurm
.Potential solutions:
seml <collection> cancel
after the jobs have been actually cancelled byslurm