Make cluster directory access more robust

braymp commented 8 years ago

From @dlogan on January 29, 2014 19:17

Batch # 4203 http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203 is running, however 4% of it's batches have failed so far, all with the same error (example below). They all seem to be a temporary directory access failure. I presume temporary because I can manually cd to the supposedly offending directory just fine.

Instead of me resubmitting them manually, can we add a "try, wait, try again" loop in loadimages?

...

Tue Jan 28 19:22:52 2014: Image # 20306, module MeasureObjectIntensity # 7: 0.88 sec
Tue Jan 28 19:22:53 2014: Image # 20306, module MeasureImageIntensity # 17: 3.22 sec
Tue Jan 28 19:22:56 2014: Image # 20306, module ExportToDatabase # 19: 0.20 sec
Tue Jan 28 19:22:57 2014: Image # 20306, module CreateBatchFiles # 20: 0.00 sec
Error detected during run of module LoadData
Traceback (most recent call last):
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/pipeline.py", line 1747, in run_with_yield
    module.run(workspace)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loaddata.py", line 1065, in run
    image = workspace.image_set.get_image(image_name)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/measurements.py", line 1485, in get_image
    image = matching_providers[0].provide_image(self)
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3138, in provide_image
    self.cache_file()
  File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3063, in cache_file
    raise IOError("Test for access to directory failed. Directory: %s" %path)
IOError: Test for access to directory failed. Directory: /cbnt/cbimageX/HCS/shanmeghan/combinatorialscreen-dcn/nocode/2013-03-28/38265
Tue Jan 28 19:22:57 2014: Image # 20307, module LoadData # 1: 0.00 sec
Exiting the JVM monitor thread
FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.

Copied from original issue: CellProfiler/CellProfiler#1033

braymp commented 8 years ago

From @LeeKamentsky on January 29, 2014 19:28

I can always try a certain number of times, maybe with pauses in between, but my gut feeling is that the problem might be local to that cluster node and perhaps it's not recoverable. This might be a case where it's simpler and more reliable to deal with the failure at a higher level (have BatchProfiler 2.0 run CellProfiler again on another node). Vebjorn what do you think? Also is it worth pinging IT to ask them to look at the logs?

On Wed, Jan 29, 2014 at 2:17 PM, David Logan notifications@github.comwrote:

Batch # 4203 http://imagingweb.broadinstitute.org/batchprofiler/cgi-bin/FileUI/CellProfiler/BatchProfiler/ViewBatch.py?batch_id=4203 is running, however 4% of it's batches have failed so far, all with the same error (example below). They all seem to be a temporary directory access failure. I presume temporary because I can manually cd to the supposedly offending directory just fine.

Instead of me resubmitting them manually, can we add a "try, wait, try again" loop in loadimages?

...

Tue Jan 28 19:22:52 2014: Image # 20306, module MeasureObjectIntensity # 7: 0.88 sec Tue Jan 28 19:22:53 2014: Image # 20306, module MeasureImageIntensity # 17: 3.22 sec Tue Jan 28 19:22:56 2014: Image # 20306, module ExportToDatabase # 19: 0.20 sec Tue Jan 28 19:22:57 2014: Image # 20306, module CreateBatchFiles # 20: 0.00 sec Error detected during run of module LoadData Traceback (most recent call last): File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/pipeline.py", line 1747, in run_with_yield module.run(workspace) File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loaddata.py", line 1065, in run image = workspace.image_set.get_image(image_name) File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/measurements.py", line 1485, in get_image image = matching_providers[0].provide_image(self) File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3138, in provide_image self.cache_file() File "/imaging/analysis/CPClusterSingle/CellProfiler-2.0/2013_12_03_15_53_04.toolbox.0.12.0/cellprofiler/modules/loadimages.py", line 3063, in cache_file raise IOError("Test for access to directory failed. Directory: %s" %path) IOError: Test for access to directory failed. Directory: /cbnt/cbimageX/HCS/shanmeghan/combinatorialscreen-dcn/nocode/2013-03-28/38265 Tue Jan 28 19:22:57 2014: Image # 20307, module LoadData # 1: 0.00 sec Exiting the JVM monitor thread FreeFontPath: FPE "unix/:7100" refcount is 2, should be 1; fixing.

Reply to this email directly or view it on GitHubhttps://github.com/CellProfiler/CellProfiler/issues/1033 .

braymp commented 8 years ago

From @dlogan on January 29, 2014 19:37

Aha, you are likely right. I just checked and (so far) the node is always the same, node1625.

/imaging/analysis/2007_11_07_Hepatoxicity_SPARC/2013_03_27_combinatorialscreen/Main_pipeline_output/2014_01_28_CP2p1_RUN_STRMLD/txt_output]$ grep -B 1000 rror * | grep node
20301_to_20650.txt-Sender: LSF System <lsf@node1625>
20301_to_20650.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
24851_to_25200.txt-Sender: LSF System <lsf@node1625>
24851_to_25200.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51451_to_51800.txt-Sender: LSF System <lsf@node1625>
51451_to_51800.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
51801_to_52150.txt-Sender: LSF System <lsf@node1625>
51801_to_52150.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
59151_to_59500.txt-Sender: LSF System <lsf@node1625>
59151_to_59500.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.
60901_to_61250.txt-Sender: LSF System <lsf@node1625>
60901_to_61250.txt-Job was executed on host(s) <node1625>, in queue <bweek>, as user <imageweb> in cluster <cromwell>.

braymp commented 8 years ago

From @dlogan on January 29, 2014 21:23

I emailed Help to get them to look at node1625 Help Ticket # 409355

braymp commented 8 years ago

From @ljosa on January 31, 2014 19:15

I think these kinds of errors are rarely temporary enough that it makes sense to sleep and retry; that only delays the inevitable. Better to fail fast and restart failed jobs from the top level.

Ideally, BatchProfiler should do that ASAP instead of waiting for a human to diagnose and trigger restarts, but I guess we don't want to rewrite BP right now…

CellProfiler / BatchProfiler

Make cluster directory access more robust #24