CellProfiler / BatchProfiler

1 stars 1 forks source link

Need to resubmit some cluster jobs #3

Closed dlogan closed 8 years ago

dlogan commented 8 years ago

(MOVED from CellProfiler/CellProfiler https://github.com/CellProfiler/CellProfiler/issues/1522) I've had to resubmit cluster jobs multiple times for Batch 17 http://imagewebrhel6/batchprofiler/cgi-bin/ViewBatch.py?batch_id=17 . Most jobs within this batch completed initially, but some were left in the SUBMITTED stage overnight. Once I killed them and then re-submitted, some more finished, but not all. There are still a few jobs that have not completed, though just now they are in the RUNNING phase after another round of kill and resubmit.

It may be that the cluster is busy, but at least a couple times, killing and then re-submitting got some more to start to go to RUNNING, so it seems like an issue to be looked into. Sorry, I can't think of any other info to debug this.

dlogan commented 8 years ago

@braymp wrote: I've noticed this myself. The 'short' queue is 2 hrs long; if it exceeds that limit, they're killed, but silently, with nothing in the error logs and no change to the status.

dlogan commented 8 years ago

Related to (same as?) #2

dlogan commented 8 years ago

@LeeKamentsky wrote: Filed with IT as INC0070582

LeeKamentsky commented 8 years ago

Having #2 fixed will tell you that the job was killed - this issue should deal with the problem of queue timeouts being per-job rather than per-task.

LeeKamentsky commented 8 years ago

It looks like this was caused by an IT issue. There was no queue timeout. The job failed to start because a drive was not properly mounted on one of the nodes.The good news is that it wasn't a queue timeout, so we are good with our strategy and can close this. Issue #2 will deal with reporting situations like this.