CellProfiler / BatchProfiler

1 stars 1 forks source link

Job shows Complete even though it failed #19

Open dlogan opened 8 years ago

dlogan commented 8 years ago

http://imagewebrhel6/batchprofiler/cgi-bin/ViewBatch.py?batch_id=117, run.213.19.txt initially failed because a file didn't exist. (I had tried to fix a badly named file, but the filelist wasn't updated.) In any case, the error had said a file did not exist, however the Status says "Complete". I only knew to look because it finished in 24 sec, i.e. too soon.

Subsequently, I tried to get ViewBatch to show Resubmit by clicking the Delete All button. The txt and err files deleted, but there is no Resubmit button, i.e. the Status still says Complete, and I don't know how to resubmit this individual job run.213.19

dlogan commented 8 years ago

Related to #12?

dlogan commented 8 years ago

In fact, there are other jobs that failed and did not produce rows in the Per_Image table, yet show as "Complete".

These have Memory Errors:

http://imagewebrhel6/batchprofiler/cgi-bin/ViewTextFile.py?batch_array_id=213&task_id=21&file_type=text
http://imagewebrhel6/batchprofiler/cgi-bin/ViewTextFile.py?batch_array_id=213&task_id=22&file_type=text
LeeKamentsky commented 8 years ago

The current master branch (and the way it was on 9/10/15 at your checkout) exits with a status code of 0 even if there's an exception. Code that we're planning to check in has this facility in it.

I don't use the "done" file and maybe that's a mistake. I'd like to put working on BatchProfiler on the back burner for a couple of weeks, though, maybe afterwards, revisit.

dlogan commented 8 years ago

But how can I resubmit the jobs that have failed? It seems impossible from any ViewBatch page since all batches (mis)report Complete. Can I submit via sudo as imageweb in any way? I looked at the job_scripts but I can't see how to do this.

LeeKamentsky commented 8 years ago

Sorry David, for this case, how about if I mark the ones that failed as failed and then you can resubmit. I am guessing that the memory error is a problem that will reoccur. Is it possible that a large number of cells or particles are being segmented? The code is blowing up in a place that suggests that.

On Tue, Nov 24, 2015 at 9:25 AM, David Logan notifications@github.com wrote:

But how can I resubmit the jobs that have failed? It seems impossible from any ViewBatch page since all batches (mis)report Complete. Can I submit via sudo as imageweb in any way? I looked at the job_scripts but I can't see how to do this.

— Reply to this email directly or view it on GitHub https://github.com/CellProfiler/BatchProfiler/issues/19#issuecomment-159282028 .

dlogan commented 8 years ago

Sure, please mark them as failed (how can I do that myself?). Just now I raised the memory_limit in the batchprofiler_2/batch database because yes, there are lots of synapse objects per image -- will that still work as in the old db scheme for resubmitting?

LeeKamentsky commented 8 years ago

Raising the memory limit should work. To reset the status, you can do something like this Look at the text file names which are in the form, run... Do the following select statement to get the task_status_id's to delete:

select task_status_id from run_job_status where batch_array_id = 213 and task_id in (22, 23)

Then copy the task status IDs and do

delete from task_status where task_status_id in (203818, 203825)

I just did this for 203818 to see if it worked and it did. You can do it for the other one. I'm running a script now to see if any other tasks suffered from the same problem though, so perhaps you should hold off to see if I found more.

On Tue, Nov 24, 2015 at 9:39 AM, David Logan notifications@github.com wrote:

Sure, please mark them as failed (how can I do that myself?). Just now I raised the memory_limit in the batchprofiler_2/batch database because yes, there are lots of synapse objects per image -- will that still work as in the old db scheme for resubmitting?

— Reply to this email directly or view it on GitHub https://github.com/CellProfiler/BatchProfiler/issues/19#issuecomment-159287625 .

LeeKamentsky commented 8 years ago

There were only the two... you can delete 203825 if you want to resubmit.

dlogan commented 8 years ago

Cool - I think I get it now. I will delete 203825 and resubmit, thanks!

dlogan commented 8 years ago

Wait - it looks to me like it is run.213.21.txt that is not done. (Also 213.19 for a different reason) Does that makes sense to you, rather than 213.23?

LeeKamentsky commented 8 years ago

I made run.213.21.txt's status change to test but left run.213.23.txt as "Done" so you could try out the delete. I don't know what you're running for MySQL, but it may be that you have to commit the transaction? (try tying "commit").

dlogan commented 8 years ago

I was just being cautious before and trying to understand the procedure. I just ran

delete from task_status where task_status_id in (203825)

successfully and without even hitting (re)submit, it now reports Running. Is that right that it submits after the delete?