Closed MatthewBM closed 5 years ago
I've been running multicut well on a cluster node with 1.5TB ram but it has a segmentation fault, presumably running out of RAM, on arrays larger than ~ 5k x 5k x 15.
Yes, this script does not scale well to large volumes. Instead you will need to use the functionality from this repository. You can find an example with some explanations here: https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py
Note that there are some important prerequisites to use this:
Also, does your cluster run any scheduling system?
For now, I support slurm
and lsf
, but it is straightforward to extend this to other schedulers, by implementing a class like https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/cluster_tasks.py#L374.
Yes we use slurm.
I do have the cluster_env conda environment built, but it wasn't finding the cluster_tools module so I added this:
export PYTHONPATH="/home/mmadany/miniconda3/envs/cluster_env/bin:/home/mmadany/Multicut/cluster_tools-master:/home/mmadany/Multicut/cluster_tools-master/cluster_tools"
I have configured z5 and converted to n5 files. When I try to run that example script, I get this error:
import os import json import luigi from cluster_tools import MulticutSegmentationWorkflow
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/__init__.py", line 1, in <module> from .workflows import MulticutSegmentationWorkflow File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/workflows.py", line 5, in <module> from .watershed import WatershedWorkflow File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/__init__.py", line 1, in <module> from .watershed_workflow import WatershedWorkflow File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/watershed_workflow.py", line 4, in <module> from . import watershed as watershed_tasks File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/watershed.py", line 11, in <module> from nifty.filters import nonMaximumDistanceSuppression ImportError: cannot import name 'nonMaximumDistanceSuppression' from 'nifty.filters' (/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/nifty/filters/__init__.py)
Yes, sorry, I just implemented nonMaximumDistanceSuppression
and it's not in the conda package yet.
Please check out the latest commit 03ec3b8 and try again.
I added a check to skip nonMaximumDistanceSuppression
if it's not available.
Ok, that runs, and I see it's doing job configuration within the program, this is what I'm getting:
(cluster_env) [mmadany@comet-ln2 cluster_tools-master]$ python ~/Multicut/runluigi.py DEBUG: Checking if MulticutSegmentationWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, ws_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, ws_key=dataset1, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, node_labels_key=node_labels, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, output_key=segmentation/multicut, mask_path=, mask_key=, rf_path=, node_label_dict={}, max_jobs_merge=1, skip_ws=True, agglomerate_ws=False, two_pass_ws=False, sanity_checks=False, max_jobs_multicut=1, n_scales=1) is complete DEBUG: Checking if WriteSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, output_key=segmentation/multicut, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels, dependency=MulticutWorkflow, identifier=multicut, offset_path=) is complete INFO: Informed scheduler that task MulticutSegmentationWorkflow_False_config_mc_DummyTask_6d798a14ef has status PENDING DEBUG: Checking if MulticutWorkflow(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, target=slurm, dependency=ProblemWorkflow, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, n_scales=1, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels) is complete INFO: Informed scheduler that task WriteSlurm_node_labelsoasis_scratch_c_config_mc_4d42f4969f has status PENDING DEBUG: Checking if SolveGlobalSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels, scale=1, dependency=ReduceProblemSlurm) is complete INFO: Informed scheduler that task MulticutWorkflow_node_labelsoasis_scratch_c_config_mc_e52655bb6f has status PENDING DEBUG: Checking if ReduceProblemSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, dependency=SolveSubproblemsSlurm) is complete INFO: Informed scheduler that task SolveGlobalSlurm_node_labels__oasis_scratchcconfig_mc_8b8648e259 has status PENDING DEBUG: Checking if SolveSubproblemsSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, dependency=ProblemWorkflow) is complete INFO: Informed scheduler that task ReduceProblemSlurm_config_mc_SolveSubproblems_1_182aa76377 has status PENDING DEBUG: Checking if ProblemWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, ws_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, ws_key=dataset1, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, rf_path=, node_label_dict={}, max_jobs_merge=1, compute_costs=True, sanity_checks=False) is complete INFO: Informed scheduler that task SolveSubproblemsSlurm___config_mc_ProblemWorkflow_1_a1448fd645 has status PENDING DEBUG: Checking if EdgeCostsWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=EdgeFeaturesWorkflow, features_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, features_key=features, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/costs, node_label_dict={}, rf_path=) is complete INFO: Informed scheduler that task ProblemWorkflowTrueconfig_mc_DummyTask_3f92ce107e has status PENDING DEBUG: Checking if ProbsToCostsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, input_key=features, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/costs, features_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, features_key=features, dependency=EdgeFeaturesWorkflow, node_label_dict={}) is complete INFO: Informed scheduler that task EdgeCostsWorkflow_config_mc_EdgeFeaturesWork_features_2d838ae4dc has status PENDING DEBUG: Checking if EdgeFeaturesWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=GraphWorkflow, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, labels_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, labels_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, graph_key=s0/graph, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=features, max_jobsmerge=1) is complete INFO: Informed scheduler that task ProbsToCostsSlurmconfig_mc_EdgeFeaturesWork_features_682c0950ab has status PENDING DEBUG: Checking if MergeEdgeFeaturesSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, graph_key=s0/graph, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=features, dependency=BlockEdgeFeaturesSlurm) is complete INFO: Informed scheduler that task EdgeFeaturesWorkflow_config_mc_GraphWorkflow_s0_graph_f1bc78dfbd has status PENDING DEBUG: Checking if BlockEdgeFeaturesSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, labels_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, labels_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=GraphWorkflow) is complete INFO: Informed scheduler that task MergeEdgeFeaturesSlurm_config_mc_BlockEdgeFeature_s0_graph_34ddff7acc has status PENDING DEBUG: Checking if GraphWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/graph, nscales=1) is complete INFO: Informed scheduler that task BlockEdgeFeaturesSlurmconfig_mc_GraphWorkflow__oasis_scratch_c_8bd529565b has status PENDING DEBUG: Checking if MapEdgeIdsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, inputkey=s0/graph, scale=0, dependency=MergeSubGraphsSlurm) is complete INFO: Informed scheduler that task GraphWorkflowconfig_mc_DummyTaskoasis_scratch_c_cb70462974 has status PENDING DEBUG: Checking if MergeSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, output_key=s0/graph, merge_complete_graph=True, dependency=InitialSubGraphsSlurm) is complete INFO: Informed scheduler that task MapEdgeIdsSlurm_config_mc_MergeSubGraphsSloasis_scratch_c_6c607199dc has status PENDING DEBUG: Checking if InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask) is complete INFO: Informed scheduler that task MergeSubGraphsSlurm_config_mc_InitialSubGraphsoasis_scratch_c_8ef59ea786 has status PENDING DEBUG: Checking if DummyTask() is complete INFO: Informed scheduler that task InitialSubGraphsSlurm_config_mc_DummyTaskoasis_scratch_c_f2de7aaf60 has status PENDING INFO: Informed scheduler that task DummyTask99914b932b has status DONE INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 16 INFO: [pid 21179] Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) running InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask) sbatch: error: bank_limit plugin: expired user, can't submit job sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified ERROR: [pid 21179] Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) failed InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask) Traceback (most recent call last): File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 93, in run raise e File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 79, in run self.run_impl() File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/graph/initial_sub_graphs.py", line 76, in run_impl self.submit_jobs(n_jobs) File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 443, in submit_jobs outp = check_output(command).decode().rstrip() File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/subprocess.py", line 376, in check_output **kwargs).stdout File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/subprocess.py", line 468, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['sbatch', '-o', './tmp_mc_A/logs/initial_sub_graphs_0.log', '-e', './tmp_mc_A/error_logs/initial_sub_graphs_0.err', '-J', 'initial_sub_graphs_0', './tmp_mc_A/slurm_initial_sub_graphs.sh', '0']' returned non-zero exit status 1. DEBUG: 1 running tasks, waiting for next task to finish INFO: Informed scheduler that task InitialSubGraphsSlurm_config_mc_DummyTaskoasis_scratch_c_f2de7aaf60 has status FAILED DEBUG: Asking scheduler for work... DEBUG: Done DEBUG: There are no more tasks to run at this time DEBUG: There are 16 pending tasks possibly being run by other workers DEBUG: There are 16 pending tasks unique to this worker DEBUG: There are 16 pending tasks last scheduled by this worker INFO: Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) was stopped. Shutting down Keep-Alive thread INFO: ===== Luigi Execution Summary =====
Scheduled 17 tasks of which:
- 1 complete ones were encountered:
- 1 DummyTask()
- 1 failed:
- 1 InitialSubGraphsSlurm(...)
- 15 were left pending, among these:
- 15 had failed dependencies:
- 1 BlockEdgeFeaturesSlurm(...)
- 1 EdgeCostsWorkflow(...)
- 1 EdgeFeaturesWorkflow(...)
- 1 GraphWorkflow(...)
- 1 MapEdgeIdsSlurm(...) ...
This progress looks :( because there were failed tasks
===== Luigi Execution Summary =====
looks like this is where the cluster configuration comes in. I need to change my group id and such. Where do I change that and other sbatch variables?
You can update the slurm config here:
https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py#L69
Just add 'groupname': YOUR_GROUP_NAME
.
Also, for debugging, it might be useful to run the command that fails directly and see the error message:
sbatch -o ./tmp_mc_A/logs/initial_sub_graphs_0.log -e ./tmp_mc_A/error_logs/initial_sub_graphs_0.err -J initial_sub_graphs_0' ./tmp_mc_A/slurm_initial_sub_graphs.sh 0
Ok this is what I'm getting now:
> Traceback (most recent call last):
> File "./tmp_mc_A/initial_sub_graphs.py", line 152, in <module>
> initial_sub_graphs(job_id, path)
> File "./tmp_mc_A/initial_sub_graphs.py", line 144, in initial_sub_graphs
> ignore_label)
> File "./tmp_mc_A/initial_sub_graphs.py", line 117, in _graph_block
> increaseRoi=True)
> RuntimeError: Request has wrong type
>
That came from each of the 16 sbatch jobs. It looks like my data type might be off? I'm using the .n5 files but here's what the .h5 file's data looks like when I get a snippet of data using h5ls -d
Boundary Predictions, where 1 is the background and 0 are the boundaries:
(0,58,2742) 0.890196078431372, 0.866666666666667, 0.815686274509804, 0.717647058823529, 0.725490196078431, 0.592156862745098, 0.392156862745098, 0.192156862745098, 0.0941176470588235, 0.0431372549019608, 0.0235294117647059, 0.0196078431372549, 0.0156862745098039, (0,58,2755) 0.0235294117647059, 0.0392156862745098, 0.0901960784313725, 0.203921568627451, 0.407843137254902, 0.592156862745098, 0.756862745098039, 0.882352941176471, 0.945098039215686, 0.980392156862745, 0.992156862745098, 0.996078431372549, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, (0,58,2779) 1, 0.996078431372549, 0.996078431372549, 0.996078431372549, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Watershed file, uint32 values in sequence with no holes:
(0,3293,1530) 23660, 23660, 23660, 23660, 23660, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23368, (0,3293,1568) 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, (0,3293,1606) 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, (0,3293,1644) 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, (0,3293,1682) 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, (0,3293,1720) 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653
The watershed needs to be stored in uint64. Sorry for the late reply.
Also, to avoid issues you might get in the feature computation: boundary maps need to be stored either in uint8 or in float32
Ok I made sure my data is uint8 for boundaries and unit64 for ws but I'm still getting the same error
sbatch -o ./tmp_mc_A/logs/initial_sub_graphs_0.log -e ./tmp_mc_A/error_logs/initial_sub_graphs_0.err -J initialsub_graphs_0 ./tmp_mc_A/slurm_initial_sub_graphs.sh
cat ./tmp_mc_A/logs/initial_sub_graphs_0.log
Mytype: d your type: m 2019-04-24 21:23:27.502097: start processing job 0 2019-04-24 21:23:27.502127: reading config from ./tmp_mc_A/initial_sub_graphs_job_0.config 2019-04-24 21:23:27.515858: start processing block 0
cat ./tmp_mc_A/error_logs/initial_sub_graphs_0.err
Traceback (most recent call last): File "./tmp_mc_A/initial_sub_graphs.py", line 152, in
initial_sub_graphs(job_id, path) File "./tmp_mc_A/initial_sub_graphs.py", line 144, in initial_sub_graphs ignore_label) File "./tmp_mc_A/initial_sub_graphs.py", line 117, in _graph_block increaseRoi=True) RuntimeError: Request has wrong type
Looks like this error message has occurred in you z5 repo
Merged #52, the issue should be fixed.
Originally posted by @constantinpape in https://github.com/constantinpape/z5/issues/50#issuecomment-388982862
Yes, this error message comes from z5 and indicates that some datatypes do not agree. Are you sure both boundaries and superpixel are stored correctly? Can you open them with z5 from python?
import z5py
f = z5py.File('/path/to/data.n5')
ds = f[path/in/file']
print(ds.dtype)
If you do this the dtype should be uint8
(or float32
) for the boundaries and uint64
for the superpixels.
Ok I got the workflow up on running on my data and it also worked end-to-end on the sample data. I'm just getting memory errors on my merge_graphs workers. I'd like to change the partition to the high-memory compute nodes where we have 64 cores and 1.45TB of RAM, I tried this:
global_config.update({'shebang': shebang, 'block_shape': block_shape, 'groupname': 'ddp140', 'partition': 'large-shared', 'mem': '1450G'})
but it seems the workers still get sent to the regular nodes. how can I see/edit exactly where the sbatch jobs are being submitted? I think I can also do '--mem=1G --ntasks==1' and it will split the jobs up along the resources on the node.
Ok I got the workflow up on running on my data and it also worked end-to-end on the sample data.
Glad to hear it!
I tried this:
global_config.update({'shebang': shebang, 'block_shape': block_shape, 'groupname': 'ddp140', 'partition': 'large-shared', 'mem': '1450G'}
The global config does not support arbitrary arguments, but just the ones listed here:
https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/cluster_tasks.py#L217-L224.
Note that I added the partition
option just now.
Also, you need to specify the memory limit for the individual tasks. By updating the
mem_limit
value in the task_config
. See also https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py#L92.
Hope this helps.
Ok thanks that works great,
The furthest I've been able to get is solve_global but that run for over 48 hours which is my job time limit. I've been messing around with the parrelel block size for my 7k x 5k x 400 xyz test volume (400 x 5k x 7k in the .n5 file format).
right now I'm trying block_shape = [80, 1024, 1024]
But I was wondering if you could recommend a value here? I'm allocating 180Gb of RAM to the parallel workers and I can allocate up to 1.45 TB to the single worker steps, I've done that for 'solve_global', 'solve_subproblems', and 'merge_edge_features' so far because they were giving 'out of memory' and 'segmentation fault' errors.
right now I'm trying
block_shape = [80, 1024, 1024]
That sounds reasonable.
How many nodes are in the graph (i.e. how many super-voxel ids are there?).
Usually the solve_global
step should be quite fast if the problem was reduced by solving the subproblems.
Have you tried running everything on a smaller cutout of the data (say 200 x 1024 x 1024) and checked the results?
One potential issue could be that your boundary maps follow a different convention than what I expect:
I assume boundaries to correspond to high values (i.e. 1
means maximal boundary probability for a pixel).
If your boundary maps have the opposite convention, you can set invert_inputs
to True
for costs_to_probs
, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/costs/probs_to_costs.py#L51.
If you use the correct boundary convention and the cutout results look decent, there are 2 options to speed up the final multicut:
agglomerator
to greedy-additive
in the config of solve_global
, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/multicut/solve_global.py#L43.n_scales > 1
, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/workflows.py#L207.These two options can also be combined. Note that both can reduce the quality of the resulting segmentation a bit, but from my experience the effect should not be very significant.
I have 22 million superpixels
high boundary chance is 255, uint8. What's the difference between 'time_limit' and 'time_limit_solver'? This is a cross of my current volume (lots of myelin unfortunately):
After I get that to work and understand what the time / parallelization values should be, I'd like to try on volumes that are more around the size of 15k x 15k x 1k, like this: Does that look like it could handle a greedy solver with higher n_scales?
I have 22 million superpixels
That should be fine; I have solved problems with about 2 orders of magnitude more superpixels with this pipeline.
high boundary chance is 255, uint8
That's good, you don't need to change invert_inputs
then.
What's the difference between 'time_limit' and 'time_limit_solver'?
time_limit
is the maximum time a job will run; it is passed as value for the -t
parameter to slurm.
time_limit_solver
is a time limit that is passed to the actual multicut solver.
I forgot to mention this parameter earlier; setting time_limit_solver
might actually fix your problem.
You should set it to ~ 4 hours less than time_limit
. (time_limit_solver
is soft, which means that
the solver will not abruptly stop after the time has passed, but will only check for it after completing an internal iteration. Depending on the problem size, the iterations can take quite a bit, that's why it's safer to give some leeway compared to time_limit
).
Does that look like it could handle a greedy solver with higher n_scales?
Yes, this looks feasible.
ok, looks like time_limit is in minutes (or slurm/sbatch command format) and time_limit_solver is in seconds, correct?
yes that's correct
Ok, it's working full process and looks great, I'll email you a video of what it looks like. Thanks again!
You\re very welcome and thanks for your patience. I am looking forward to see the results :).
Hi Constantin,
I've been running multicut well on a cluster node with 1.5TB ram but it has a segmentation fault, presumably running out of RAM, on arrays larger than ~ 5k x 5k x 15.
How do I implement the large volume processing? I believe that is what the package Luigi is for, but I can't find a concise example usage in the repository.
For reference this is the script I'm running (that you were kind enough to send me):
Thanks again, Matthew