Closed pipitone closed 8 years ago
Not surprised. I once had the pleasure of dealing with a version of PBS on a HPC cluster that would throw in the towel if you specified a job number dependency that didn't exist! Made me distinctly unpopular with the Uni Cluster admins when I would bring down the entire system at will requiring a reboot...
Some of the malarkey here:
https://github.com/andrewjanke/volgenmodel/blob/master/volgenmodel#L752
Is to deal with the eventuality you are striking now...
On 3 August 2016 at 21:32, Jon Pipitone notifications@github.com wrote:
So here's something. When you run qbatch with --depend it matches all jobs returned by qstat, even if they are already completed. With whatever version of PBS we are running at CAMH (I think it might be an ancient version of TORQUE because qsub --version returns 3.0.4), jobs that depend on already completed jobs are held forever.
To reproduce:
$ echo "hello" | qbatch -163592.mgmt2.scc.camh.net
wait until 163592 has completed
$ echo "hello" | qbatch --depend 163592.mgmt2.scc.camh.net -163593.mgmt2.scc.camh.net
$ qstat | grep 163593 163593.mgmt2 STDIN jpipitone 0 H short1n
$ qstat -f 163593 | grep depend depend = afterok:163592.mgmt2.scc.camh.net@mgmt2.scc.camh.net
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pipitone/qbatch/issues/119, or mute the thread https://github.com/notifications/unsubscribe-auth/AATwisCzHlS8yqQLH_Rt8ViuLUyseOosks5qcHxjgaJpZM4JbkZT .
We should be able to adapt the qstat XML parsing to only matched queued or running jobs.
Good catch.
Okay, in 8808c9a completed or errored jobs are ignored.
So here's something. When you run qbatch with
--depend
it matches all jobs returned by qstat, even if they are already completed. With whatever version of PBS we are running at CAMH (I think it might be an ancient version of TORQUE because qsub --version returns 3.0.4), jobs that depend on already completed jobs are held forever.To reproduce: