Closed Durabun closed 5 years ago
Hmm. I'm not familiar with that one offhand. Would you please paste the commands you are running (./run.sh <args>
; squeue
; etc..) and all of their output? A screenshot will also work.
Hello I am also getting the same thing. Although, jobs completed before but now they got stuck. I am attaching a screenshot. Thanks
OK. I am able to reproduce the problem and will report it to CCI support.
FEP4smth@q: ~/fep/a1 (master)$ sbatch ./run.sh ./build/a1tet
Submitted batch job 926309
FEP4smth@q: ~/fep/a1 (master)$ squeue
JOBID PARTITION NAME USER ST TIME NODES MIDPLANELIST(REASON)
926309 debug fepsA1 FEP4smth R 0:03 1 bgq0210[02200]
FEP4smth@q: ~/fep/a1 (master)$ squeue
JOBID PARTITION NAME USER ST TIME NODES MIDPLANELIST(REASON)
926309 debug fepsA1 FEP4smth PD 0:00 1 (launch failed requeued held)
FEP4smth@q: ~/fep/a1 (master)$
Ok Thank you
You're welcome. I submitted a support ticket and will update this issue when there is more info.
Someone in another project (repeatedly?) attempted to run a compute node binary on the front-end node (amos
, q
, q2
) instead of using sbatch
/srun
to run on the compute nodes. This put enough load on the node to disrupt the operation of the job scheduler.
The jobs with the status launch failed requeued held
can be released with scontrol release <jobid>
. See the example below:
FEP4smth@q: ~/fep/a1 (master)$ squeue
JOBID PARTITION NAME USER ST TIME NODES MIDPLANELIST(REASON)
926309 debug fepsA1 FEP4smth PD 0:00 1 (launch failed requeued held)
FEP4smth@q: ~/fep/a1 (master)$ scontrol release 926309
FEP4smth@q: ~/fep/a1 (master)$ squeue
JOBID PARTITION NAME USER ST TIME NODES MIDPLANELIST(REASON)
926309 debug fepsA1 FEP4smth R 0:02 1 bgq0210[23300]
FEP4smth@q: ~/fep/a1 (master)$ squeue
JOBID PARTITION NAME USER ST TIME NODES MIDPLANELIST(REASON)
FEP4smth@q: ~/fep/a1 (master)$
Hi, I want to know what exactly the "launch failed requeued held" message means? It comes up when calling squeue after running sbatch. What is the reason for this message? What can I do about it? Thanks.