SCOREC / fep

Finite Element Programming course materials
6 stars 4 forks source link

Launch failed requeued held #2

Closed Durabun closed 5 years ago

Durabun commented 5 years ago

Hi, I want to know what exactly the "launch failed requeued held" message means? It comes up when calling squeue after running sbatch. What is the reason for this message? What can I do about it? Thanks.

cwsmith commented 5 years ago

Hmm. I'm not familiar with that one offhand. Would you please paste the commands you are running (./run.sh <args>; squeue; etc..) and all of their output? A screenshot will also work.

Aerokaur commented 5 years ago

Hello I am also getting the same thing. Although, jobs completed before but now they got stuck. I am attaching a screenshot. Thanks screenshot-1

cwsmith commented 5 years ago

OK. I am able to reproduce the problem and will report it to CCI support.

FEP4smth@q: ~/fep/a1 (master)$ sbatch ./run.sh ./build/a1tet 
Submitted batch job 926309
FEP4smth@q: ~/fep/a1 (master)$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES MIDPLANELIST(REASON)
            926309     debug   fepsA1 FEP4smth  R       0:03      1 bgq0210[02200]
FEP4smth@q: ~/fep/a1 (master)$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES MIDPLANELIST(REASON)
            926309     debug   fepsA1 FEP4smth PD       0:00      1 (launch failed requeued held)
FEP4smth@q: ~/fep/a1 (master)$ 
Aerokaur commented 5 years ago

Ok Thank you

cwsmith commented 5 years ago

You're welcome. I submitted a support ticket and will update this issue when there is more info.

cwsmith commented 5 years ago

Someone in another project (repeatedly?) attempted to run a compute node binary on the front-end node (amos, q, q2) instead of using sbatch/srun to run on the compute nodes. This put enough load on the node to disrupt the operation of the job scheduler.

The jobs with the status launch failed requeued held can be released with scontrol release <jobid>. See the example below:

FEP4smth@q: ~/fep/a1 (master)$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES MIDPLANELIST(REASON)
            926309     debug   fepsA1 FEP4smth PD       0:00      1 (launch failed requeued held)
FEP4smth@q: ~/fep/a1 (master)$ scontrol release 926309
FEP4smth@q: ~/fep/a1 (master)$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES MIDPLANELIST(REASON)
            926309     debug   fepsA1 FEP4smth  R       0:02      1 bgq0210[23300]
FEP4smth@q: ~/fep/a1 (master)$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES MIDPLANELIST(REASON)
FEP4smth@q: ~/fep/a1 (master)$