TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
168 stars 30 forks source link

Command chaining may cause unexpected behavior #87

Closed n-gao closed 2 years ago

n-gao commented 2 years ago

Chaining commands via seml <collection> [commands] may have unintended side effects, especially when canceling jobs. It may happen that the following commands get executed before all jobs have been completely canceled by slurm.

Potential solutions:

danielzuegner commented 2 years ago

For the second solution, I guess we'd need to know exactly which jobs were meant to be cancelled and check their status. here we create a list of Slurm IDs to be cancelled. We could run

squeue -u user -t RUNNING,PENDING,COMPLETING,CONFIGURING,RESIZING,SUSPENDED

to get the IDs of all jobs by the user which are still not completed / failed. This returns something like this:

    JOBID  PARTITION    NAME   USER ST       TIME  NODES NODELIST(REASON) 
6718258_18   gpu_all jobname   user  R    5:04:07      1 gpu14 
6718258_19   gpu_all jobname   user  R    5:04:07      1 gpu14 
 6718258_0   gpu_all jobname   user  R    5:27:16      1 gpu20 
 6718258_1   gpu_all jobname   user  R    5:27:16      1 gpu20 
 6718258_2   gpu_all jobname   user  R    5:27:16      1 gpu15 
 6718258_3   gpu_all jobname   user  R    5:27:16      1 gpu08 
 6718258_4   gpu_all jobname   user  R    5:27:16      1 gpu08 
 6718258_5   gpu_all jobname   user  R    5:27:16      1 gpu08 
 6718258_9   gpu_all jobname   user  R    5:27:16      1 gpu09 
6718258_10   gpu_all jobname   user  R    5:27:16      1 gpu10 
6718258_11   gpu_all jobname   user  R    5:27:16      1 gpu10 

We could then extract the JOBID for each of these jobs and see if the intersection with the ones that should have been cancelled is empty. What do you think?

n-gao commented 2 years ago

I agree, this is what I had in mind to circumvent this issue. A way to make it even simpler is to use:

squeue -n <comma separated list of job_ids> -o %A

This way the output is a list of non-finished job ids (+1 line header).

JOBID
6720463
6720464
6720465
6720466
6720462
6719977
6719978
6719979
6719980
6719976
6718733
6718734
6718735
6718736
6718732