Open braymp opened 8 years ago
From @LeeKamentsky on January 29, 2014 19:28
I can always try a certain number of times, maybe with pauses in between, but my gut feeling is that the problem might be local to that cluster node and perhaps it's not recoverable. This might be a case where it's simpler and more reliable to deal with the failure at a higher level (have BatchProfiler 2.0 run CellProfiler again on another node). Vebjorn what do you think? Also is it worth pinging IT to ask them to look at the logs?
On Wed, Jan 29, 2014 at 2:17 PM, David Logan notifications@github.comwrote:
Batch # 4203 http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203 is running, however 4% of it's batches have failed so far, all with the same error (example below). They all seem to be a temporary directory access failure. I presume temporary because I can manually cd to the supposedly offending directory just fine.
Instead of me resubmitting them manually, can we add a "try, wait, try again" loop in loadimages?
...
Tue Jan 28 19:22:52 2014: Image # 20306, module MeasureObjectIntensity # 7: 0.88 sec Tue Jan 28 19:22:53 2014: Image # 20306, module MeasureImageIntensity # 17: 3.22 sec Tue Jan 28 19:22:56 2014: Image # 20306, module ExportToDatabase # 19: 0.20 sec Tue Jan 28 19:22:57 2014: Image # 20306, module CreateBatchFiles # 20: 0.00 sec Error detected during run of module LoadData Traceback (most recent call last): File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/pipeline.py", line 1747, in run_with_yield module.run(workspace) File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loaddata.py", line 1065, in run image = workspace.image_set.get_image(image_name) File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/measurements.py", line 1485, in get_image image = matching_providers[0].provide_image(self) File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3138, in provide_image self.cache_file() File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3063, in cache_file raise IOError("Test for access to directory failed. Directory: %s" %path) IOError: Test for access to directory failed. Directory: /cbnt/cbimageX/HCS/shanmeghan/combinatorialscreen-dcn/nocode/2013-03-28/38265 Tue Jan 28 19:22:57 2014: Image # 20307, module LoadData # 1: 0.00 sec Exiting the JVM monitor thread FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.
Reply to this email directly or view it on GitHubhttps://github.com/CellProfiler/CellProfiler/issues/1033 .
From @dlogan on January 29, 2014 19:37
Aha, you are likely right. I just checked and (so far) the node is always the same, node1625.
/imaging/analysis/2007_11_07_Hepatoxicity_SPARC/2013_03_27_combinatorialscreen/Main_pipeline_output/2014_01_28_CP2p1_RUN_STRMLD/txt_output]$ grep -B 1000 rror * | grep node
20301_to_20650.txt-Sender: LSF System <lsf@node1625>
20301_to_20650.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
24851_to_25200.txt-Sender: LSF System <lsf@node1625>
24851_to_25200.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51451_to_51800.txt-Sender: LSF System <lsf@node1625>
51451_to_51800.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51801_to_52150.txt-Sender: LSF System <lsf@node1625>
51801_to_52150.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
59151_to_59500.txt-Sender: LSF System <lsf@node1625>
59151_to_59500.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
60901_to_61250.txt-Sender: LSF System <lsf@node1625>
60901_to_61250.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
From @dlogan on January 29, 2014 21:23
I emailed Help to get them to look at node1625 Help Ticket # 409355
From @ljosa on January 31, 2014 19:15
I think these kinds of errors are rarely temporary enough that it makes sense to sleep and retry; that only delays the inevitable. Better to fail fast and restart failed jobs from the top level.
Ideally, BatchProfiler should do that ASAP instead of waiting for a human to diagnose and trigger restarts, but I guess we don't want to rewrite BP right now…
From @dlogan on January 29, 2014 19:17
Batch # 4203 http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203 is running, however 4% of it's batches have failed so far, all with the same error (example below). They all seem to be a temporary directory access failure. I presume temporary because I can manually cd to the supposedly offending directory just fine.
Instead of me resubmitting them manually, can we add a "try, wait, try again" loop in loadimages?
...
Copied from original issue: CellProfiler/CellProfiler#1033