CoBrALab / qbatch

The Unlicense
28 stars 13 forks source link

PBS holds jobs which depend on already completed jobs? #119

Closed pipitone closed 8 years ago

pipitone commented 8 years ago

So here's something. When you run qbatch with --depend it matches all jobs returned by qstat, even if they are already completed. With whatever version of PBS we are running at CAMH (I think it might be an ancient version of TORQUE because qsub --version returns 3.0.4), jobs that depend on already completed jobs are held forever.

To reproduce:

$ echo "hello" | qbatch -
163592.mgmt2.scc.camh.net

# wait until 163592 has completed

$ echo "hello" | qbatch --depend 163592.mgmt2.scc.camh.net -
163593.mgmt2.scc.camh.net

$ qstat | grep 163593
163593.mgmt2               STDIN            jpipitone              0 H short1n   

$ qstat -f 163593 | grep depend
    depend = afterok:163592.mgmt2.scc.camh.net@mgmt2.scc.camh.net
andrewjanke commented 8 years ago

Not surprised. I once had the pleasure of dealing with a version of PBS on a HPC cluster that would throw in the towel if you specified a job number dependency that didn't exist! Made me distinctly unpopular with the Uni Cluster admins when I would bring down the entire system at will requiring a reboot...

Some of the malarkey here:

https://github.com/andrewjanke/volgenmodel/blob/master/volgenmodel#L752

Is to deal with the eventuality you are striking now...

On 3 August 2016 at 21:32, Jon Pipitone notifications@github.com wrote:

So here's something. When you run qbatch with --depend it matches all jobs returned by qstat, even if they are already completed. With whatever version of PBS we are running at CAMH (I think it might be an ancient version of TORQUE because qsub --version returns 3.0.4), jobs that depend on already completed jobs are held forever.

To reproduce:

$ echo "hello" | qbatch -163592.mgmt2.scc.camh.net

wait until 163592 has completed

$ echo "hello" | qbatch --depend 163592.mgmt2.scc.camh.net -163593.mgmt2.scc.camh.net

$ qstat | grep 163593 163593.mgmt2 STDIN jpipitone 0 H short1n

$ qstat -f 163593 | grep depend depend = afterok:163592.mgmt2.scc.camh.net@mgmt2.scc.camh.net

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pipitone/qbatch/issues/119, or mute the thread https://github.com/notifications/unsubscribe-auth/AATwisCzHlS8yqQLH_Rt8ViuLUyseOosks5qcHxjgaJpZM4JbkZT .

gdevenyi commented 8 years ago

We should be able to adapt the qstat XML parsing to only matched queued or running jobs.

Good catch.

pipitone commented 8 years ago

Okay, in 8808c9a completed or errored jobs are ignored.