CellProfiler / BatchProfiler

1 stars 1 forks source link

Offer to resubmit by requeueing #5

Closed LeeKamentsky closed 8 years ago

LeeKamentsky commented 8 years ago

It's possible to check for an error state that could be caused by a cluster node failure and then requeue tasks in that error state. qstat -t -j <job-id> -xml gives the error reason inside an XML document and that might be useful when determining whether the job can be requeued. The xml node, /detailed_job_info/djob_info/element/JB_Ja_tasks/element/JAT_message_list/element/QIM_message, gives the error reason which, in this case is "can't chdir to /imaging/analysis/CPCluster/CellProfiler-2.0/redhat_6/20150805143441_aa38cd92d244a87cb5b3b4410935775072319e2a: No such file or directory".

We should display job statuses on the ViewBatch page with a requeue button that lets the user requeue jobs if the user feels that the job errored-out in a recoverable manner.

Jean Chang reports: If portions of the job fail due to "no such file or directory" or "changing into working directory" errors, those tasks in the array will go into error state. Rather than resubmitting the tasks, you can instead clear the error state from one of the UGER admin hosts (gold, silver or pt):

qmod -c

which will requeue all tasks that had been in error state.

Regards,

Jean