constantinpape / cluster_tools

Distributed segmentation for bio-image-analysis
MIT License
34 stars 14 forks source link

How to run large volumes #8

Closed MatthewBM closed 5 years ago

MatthewBM commented 5 years ago

Hi Constantin,

I've been running multicut well on a cluster node with 1.5TB ram but it has a segmentation fault, presumably running out of RAM, on arrays larger than ~ 5k x 5k x 15.

How do I implement the large volume processing? I believe that is what the package Luigi is for, but I can't find a concise example usage in the repository.

For reference this is the script I'm running (that you were kind enough to send me):

import numpy as np
import vigra
import nifty
import nifty.graph.rag as nrag
import nifty.graph.opt.multicut as nmc

def probs_to_weights(edge_weights, edge_sizes):
    p_min = 0.001
    p_max = 1. - p_min
    edge_weights = (p_max - p_min) * edge_weights + p_min
    # probabilities to edge_weights
    edge_weights = np.log((1. - edge_weights) / edge_weights)
    edge_weights *= (edge_sizes / edge_sizes.max())
    return edge_weights

def normalize(input_):
    input_ = input_.astype('float32')
    input_ -= input_.min()
    input_ /= input_.max()
    return input_

def segment_mc(pmap_path, ws_path):
    print("Loading data ...")
    # need to invert to have multicut boundary conventions
    pmap = vigra.impex.readVolume(pmap_path).view(np.ndarray).squeeze().T
    # need to normalize the probability map to range 0, 1
    pmap = normalize(pmap)
    #pmap = 1. - pmap
    ws = vigra.impex.readVolume(ws_path).view(np.ndarray).squeeze().T.astype('uint32')
    # relabel the over-segmentation consecutively
    ws, max_id, _ = vigra.analysis.relabelConsecutive(ws, start_label=0, keep_zeros=False)
    assert pmap.shape == ws.shape
    print("Building graph ...")
    # build region adjacency graph
    rag = nrag.gridRag(ws, numberOfLabels=max_id+1)
    print("Extracting features ...")
    # extract features over the superpixel edges
    feats = nrag.accumulateEdgeMeanAndLength(rag, pmap)
    edge_weights, edge_sizes = feats[:, 0], feats[:, 1]
    # convert to multicut weights
    edge_weights = probs_to_weights(edge_weights, edge_sizes)
    print("Solving multicut ...")
    graph = nifty.graph.undirectedGraph(rag.numberOfNodes)
    graph.insertEdges(rag.uvIds())
    obj = nmc.multicutObjective(graph, edge_weights)
    # we use the most greedy solver to speed things up
    #solver = obj.kernighanLinFactory(warmStartGreedy=True).create(obj)
    solver = obj.greedyAdditiveFactory().create(obj)
    node_labels = solver.optimize()
    print("Map multicut solution to pixels")
    seg = nrag.projectScalarNodeDataToPixels(rag, node_labels)
    return seg.astype('uint32')

def segment_and_save(pmap_path, ws_path, out_path):
    seg = segment_mc(pmap_path, ws_path)
    vigra.impex.writeVolume(seg, out_path, '', compression='DEFLATE')

if __name__ == '__main__':
    # NOTE I can't read these tiffs with imageio, so I needed to fall back to vigra
    pmap_path = '/oasis/scratch/comet/mmadany/temp_project/MultiCut/lhb_sdscup/fxmemcrop/slice0001.png'
    ws_path = '/oasis/scratch/comet/mmadany/temp_project/MultiCut/lhb_sdscup/sig32_v2/slice0001.tif'
    out_path = '/oasis/scratch/comet/mmadany/temp_project/MultiCut/multiout/new_solver_nvcorr_n24sli15_segmentation.tif'
    # segment_mc(pmap_path, ws_path)
    segment_and_save(pmap_path, ws_path, out_path)

Thanks again, Matthew

constantinpape commented 5 years ago

I've been running multicut well on a cluster node with 1.5TB ram but it has a segmentation fault, presumably running out of RAM, on arrays larger than ~ 5k x 5k x 15.

Yes, this script does not scale well to large volumes. Instead you will need to use the functionality from this repository. You can find an example with some explanations here: https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py

Note that there are some important prerequisites to use this:

Also, does your cluster run any scheduling system? For now, I support slurm and lsf, but it is straightforward to extend this to other schedulers, by implementing a class like https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/cluster_tasks.py#L374.

MatthewBM commented 5 years ago

Yes we use slurm.

I do have the cluster_env conda environment built, but it wasn't finding the cluster_tools module so I added this: export PYTHONPATH="/home/mmadany/miniconda3/envs/cluster_env/bin:/home/mmadany/Multicut/cluster_tools-master:/home/mmadany/Multicut/cluster_tools-master/cluster_tools"

I have configured z5 and converted to n5 files. When I try to run that example script, I get this error:

import os import json import luigi from cluster_tools import MulticutSegmentationWorkflow

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/__init__.py", line 1, in <module>
from .workflows import MulticutSegmentationWorkflow
File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/workflows.py", line 5, in <module>
from .watershed import WatershedWorkflow
File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/__init__.py", line 1, in <module>
from .watershed_workflow import WatershedWorkflow
File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/watershed_workflow.py", line 4, in <module>
from . import watershed as watershed_tasks
File "/home/mmadany/Multicut/cluster_tools-master/cluster_tools/watershed/watershed.py", line 11, in <module>
from nifty.filters import nonMaximumDistanceSuppression
ImportError: cannot import name 'nonMaximumDistanceSuppression' from 'nifty.filters' (/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/nifty/filters/__init__.py)
constantinpape commented 5 years ago

Yes, sorry, I just implemented nonMaximumDistanceSuppression and it's not in the conda package yet. Please check out the latest commit 03ec3b8 and try again. I added a check to skip nonMaximumDistanceSuppression if it's not available.

MatthewBM commented 5 years ago

Ok, that runs, and I see it's doing job configuration within the program, this is what I'm getting:

(cluster_env) [mmadany@comet-ln2 cluster_tools-master]$ python ~/Multicut/runluigi.py DEBUG: Checking if MulticutSegmentationWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, ws_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, ws_key=dataset1, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, node_labels_key=node_labels, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, output_key=segmentation/multicut, mask_path=, mask_key=, rf_path=, node_label_dict={}, max_jobs_merge=1, skip_ws=True, agglomerate_ws=False, two_pass_ws=False, sanity_checks=False, max_jobs_multicut=1, n_scales=1) is complete DEBUG: Checking if WriteSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, output_key=segmentation/multicut, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels, dependency=MulticutWorkflow, identifier=multicut, offset_path=) is complete INFO: Informed scheduler that task MulticutSegmentationWorkflow_False_config_mc_DummyTask_6d798a14ef has status PENDING DEBUG: Checking if MulticutWorkflow(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, target=slurm, dependency=ProblemWorkflow, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, n_scales=1, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels) is complete INFO: Informed scheduler that task WriteSlurm_node_labelsoasis_scratch_c_config_mc_4d42f4969f has status PENDING DEBUG: Checking if SolveGlobalSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, assignment_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multi_luigi_out.h5, assignment_key=node_labels, scale=1, dependency=ReduceProblemSlurm) is complete INFO: Informed scheduler that task MulticutWorkflow_node_labelsoasis_scratch_c_config_mc_e52655bb6f has status PENDING DEBUG: Checking if ReduceProblemSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, dependency=SolveSubproblemsSlurm) is complete INFO: Informed scheduler that task SolveGlobalSlurm_node_labels__oasis_scratchcconfig_mc_8b8648e259 has status PENDING DEBUG: Checking if SolveSubproblemsSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, dependency=ProblemWorkflow) is complete INFO: Informed scheduler that task ReduceProblemSlurm_config_mc_SolveSubproblems_1_182aa76377 has status PENDING DEBUG: Checking if ProblemWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, ws_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, ws_key=dataset1, problem_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, rf_path=, node_label_dict={}, max_jobs_merge=1, compute_costs=True, sanity_checks=False) is complete INFO: Informed scheduler that task SolveSubproblemsSlurm___config_mc_ProblemWorkflow_1_a1448fd645 has status PENDING DEBUG: Checking if EdgeCostsWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=EdgeFeaturesWorkflow, features_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, features_key=features, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/costs, node_label_dict={}, rf_path=) is complete INFO: Informed scheduler that task ProblemWorkflowTrueconfig_mc_DummyTask_3f92ce107e has status PENDING DEBUG: Checking if ProbsToCostsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, input_key=features, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/costs, features_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, features_key=features, dependency=EdgeFeaturesWorkflow, node_label_dict={}) is complete INFO: Informed scheduler that task EdgeCostsWorkflow_config_mc_EdgeFeaturesWork_features_2d838ae4dc has status PENDING DEBUG: Checking if EdgeFeaturesWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=GraphWorkflow, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, labels_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, labels_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, graph_key=s0/graph, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=features, max_jobsmerge=1) is complete INFO: Informed scheduler that task ProbsToCostsSlurmconfig_mc_EdgeFeaturesWork_features_682c0950ab has status PENDING DEBUG: Checking if MergeEdgeFeaturesSlurm(tmp_folder=./tmp_mc_A, max_jobs=1, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, graph_key=s0/graph, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=features, dependency=BlockEdgeFeaturesSlurm) is complete INFO: Informed scheduler that task EdgeFeaturesWorkflow_config_mc_GraphWorkflow_s0_graph_f1bc78dfbd has status PENDING DEBUG: Checking if BlockEdgeFeaturesSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/fxmemf.n5, input_key=dataset1, labels_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, labels_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=GraphWorkflow) is complete INFO: Informed scheduler that task MergeEdgeFeaturesSlurm_config_mc_BlockEdgeFeature_s0_graph_34ddff7acc has status PENDING DEBUG: Checking if GraphWorkflow(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, target=slurm, dependency=DummyTask, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, output_key=s0/graph, nscales=1) is complete INFO: Informed scheduler that task BlockEdgeFeaturesSlurmconfig_mc_GraphWorkflow__oasis_scratch_c_8bd529565b has status PENDING DEBUG: Checking if MapEdgeIdsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, inputkey=s0/graph, scale=0, dependency=MergeSubGraphsSlurm) is complete INFO: Informed scheduler that task GraphWorkflowconfig_mc_DummyTaskoasis_scratch_c_cb70462974 has status PENDING DEBUG: Checking if MergeSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, scale=0, output_key=s0/graph, merge_complete_graph=True, dependency=InitialSubGraphsSlurm) is complete INFO: Informed scheduler that task MapEdgeIdsSlurm_config_mc_MergeSubGraphsSloasis_scratch_c_6c607199dc has status PENDING DEBUG: Checking if InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask) is complete INFO: Informed scheduler that task MergeSubGraphsSlurm_config_mc_InitialSubGraphsoasis_scratch_c_8ef59ea786 has status PENDING DEBUG: Checking if DummyTask() is complete INFO: Informed scheduler that task InitialSubGraphsSlurm_config_mc_DummyTaskoasis_scratch_c_f2de7aaf60 has status PENDING INFO: Informed scheduler that task DummyTask99914b932b has status DONE INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 16 INFO: [pid 21179] Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) running InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask) sbatch: error: bank_limit plugin: expired user, can't submit job sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified ERROR: [pid 21179] Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) failed InitialSubGraphsSlurm(tmp_folder=./tmp_mc_A, max_jobs=16, config_dir=./config_mc, input_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/sigv.n5, input_key=dataset1, graph_path=/oasis/scratch/comet/mmadany/temp_project/LHB_FullAuto/multiluigi_temp.n5, dependency=DummyTask) Traceback (most recent call last): File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 93, in run raise e File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 79, in run self.run_impl() File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/graph/initial_sub_graphs.py", line 76, in run_impl self.submit_jobs(n_jobs) File "/home/mmadany/Multicut/cluster_tools2/cluster_tools-master/cluster_tools/cluster_tasks.py", line 443, in submit_jobs outp = check_output(command).decode().rstrip() File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/subprocess.py", line 376, in check_output **kwargs).stdout File "/home/mmadany/miniconda3/envs/cluster_env/lib/python3.7/subprocess.py", line 468, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['sbatch', '-o', './tmp_mc_A/logs/initial_sub_graphs_0.log', '-e', './tmp_mc_A/error_logs/initial_sub_graphs_0.err', '-J', 'initial_sub_graphs_0', './tmp_mc_A/slurm_initial_sub_graphs.sh', '0']' returned non-zero exit status 1. DEBUG: 1 running tasks, waiting for next task to finish INFO: Informed scheduler that task InitialSubGraphsSlurm_config_mc_DummyTaskoasis_scratch_c_f2de7aaf60 has status FAILED DEBUG: Asking scheduler for work... DEBUG: Done DEBUG: There are no more tasks to run at this time DEBUG: There are 16 pending tasks possibly being run by other workers DEBUG: There are 16 pending tasks unique to this worker DEBUG: There are 16 pending tasks last scheduled by this worker INFO: Worker Worker(salt=544906811, workers=1, host=comet-ln2.sdsc.edu, username=mmadany, pid=21179) was stopped. Shutting down Keep-Alive thread INFO: ===== Luigi Execution Summary =====

Scheduled 17 tasks of which:

  • 1 complete ones were encountered:
    • 1 DummyTask()
  • 1 failed:
    • 1 InitialSubGraphsSlurm(...)
  • 15 were left pending, among these:
    • 15 had failed dependencies:
      • 1 BlockEdgeFeaturesSlurm(...)
      • 1 EdgeCostsWorkflow(...)
      • 1 EdgeFeaturesWorkflow(...)
      • 1 GraphWorkflow(...)
      • 1 MapEdgeIdsSlurm(...) ...

This progress looks :( because there were failed tasks

===== Luigi Execution Summary =====

looks like this is where the cluster configuration comes in. I need to change my group id and such. Where do I change that and other sbatch variables?

constantinpape commented 5 years ago

You can update the slurm config here: https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py#L69 Just add 'groupname': YOUR_GROUP_NAME.

Also, for debugging, it might be useful to run the command that fails directly and see the error message: sbatch -o ./tmp_mc_A/logs/initial_sub_graphs_0.log -e ./tmp_mc_A/error_logs/initial_sub_graphs_0.err -J initial_sub_graphs_0' ./tmp_mc_A/slurm_initial_sub_graphs.sh 0

MatthewBM commented 5 years ago

Ok this is what I'm getting now:


> Traceback (most recent call last):
>   File "./tmp_mc_A/initial_sub_graphs.py", line 152, in <module>
>     initial_sub_graphs(job_id, path)
>   File "./tmp_mc_A/initial_sub_graphs.py", line 144, in initial_sub_graphs
>     ignore_label)
>   File "./tmp_mc_A/initial_sub_graphs.py", line 117, in _graph_block
>     increaseRoi=True)
> RuntimeError: Request has wrong type
> 

That came from each of the 16 sbatch jobs. It looks like my data type might be off? I'm using the .n5 files but here's what the .h5 file's data looks like when I get a snippet of data using h5ls -d

Boundary Predictions, where 1 is the background and 0 are the boundaries:

    (0,58,2742) 0.890196078431372, 0.866666666666667, 0.815686274509804, 0.717647058823529, 0.725490196078431, 0.592156862745098, 0.392156862745098, 0.192156862745098, 0.0941176470588235, 0.0431372549019608, 0.0235294117647059, 0.0196078431372549, 0.0156862745098039,
    (0,58,2755) 0.0235294117647059, 0.0392156862745098, 0.0901960784313725, 0.203921568627451, 0.407843137254902, 0.592156862745098, 0.756862745098039, 0.882352941176471, 0.945098039215686, 0.980392156862745, 0.992156862745098, 0.996078431372549, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    (0,58,2779) 1, 0.996078431372549, 0.996078431372549, 0.996078431372549, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Watershed file, uint32 values in sequence with no holes:

(0,3293,1530) 23660, 23660, 23660, 23660, 23660, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23715, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23698, 23368, (0,3293,1568) 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, (0,3293,1606) 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23368, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23652, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, (0,3293,1644) 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23124, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, (0,3293,1682) 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23643, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, (0,3293,1720) 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653, 23653

constantinpape commented 5 years ago

The watershed needs to be stored in uint64. Sorry for the late reply.

constantinpape commented 5 years ago

Also, to avoid issues you might get in the feature computation: boundary maps need to be stored either in uint8 or in float32

MatthewBM commented 5 years ago

Ok I made sure my data is uint8 for boundaries and unit64 for ws but I'm still getting the same error

sbatch -o ./tmp_mc_A/logs/initial_sub_graphs_0.log -e ./tmp_mc_A/error_logs/initial_sub_graphs_0.err -J initialsub_graphs_0 ./tmp_mc_A/slurm_initial_sub_graphs.sh cat ./tmp_mc_A/logs/initial_sub_graphs_0.log

Mytype: d your type: m 2019-04-24 21:23:27.502097: start processing job 0 2019-04-24 21:23:27.502127: reading config from ./tmp_mc_A/initial_sub_graphs_job_0.config 2019-04-24 21:23:27.515858: start processing block 0

cat ./tmp_mc_A/error_logs/initial_sub_graphs_0.err

Traceback (most recent call last): File "./tmp_mc_A/initial_sub_graphs.py", line 152, in initial_sub_graphs(job_id, path) File "./tmp_mc_A/initial_sub_graphs.py", line 144, in initial_sub_graphs ignore_label) File "./tmp_mc_A/initial_sub_graphs.py", line 117, in _graph_block increaseRoi=True) RuntimeError: Request has wrong type

MatthewBM commented 5 years ago

Looks like this error message has occurred in you z5 repo

Merged #52, the issue should be fixed.

Originally posted by @constantinpape in https://github.com/constantinpape/z5/issues/50#issuecomment-388982862

constantinpape commented 5 years ago

Yes, this error message comes from z5 and indicates that some datatypes do not agree. Are you sure both boundaries and superpixel are stored correctly? Can you open them with z5 from python?

import z5py
f = z5py.File('/path/to/data.n5')
ds = f[path/in/file']
print(ds.dtype)

If you do this the dtype should be uint8 (or float32) for the boundaries and uint64 for the superpixels.

MatthewBM commented 5 years ago

Ok I got the workflow up on running on my data and it also worked end-to-end on the sample data. I'm just getting memory errors on my merge_graphs workers. I'd like to change the partition to the high-memory compute nodes where we have 64 cores and 1.45TB of RAM, I tried this: global_config.update({'shebang': shebang, 'block_shape': block_shape, 'groupname': 'ddp140', 'partition': 'large-shared', 'mem': '1450G'}) but it seems the workers still get sent to the regular nodes. how can I see/edit exactly where the sbatch jobs are being submitted? I think I can also do '--mem=1G --ntasks==1' and it will split the jobs up along the resources on the node.

constantinpape commented 5 years ago

Ok I got the workflow up on running on my data and it also worked end-to-end on the sample data.

Glad to hear it!

I tried this: global_config.update({'shebang': shebang, 'block_shape': block_shape, 'groupname': 'ddp140', 'partition': 'large-shared', 'mem': '1450G'}

The global config does not support arbitrary arguments, but just the ones listed here: https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/cluster_tasks.py#L217-L224. Note that I added the partition option just now.

Also, you need to specify the memory limit for the individual tasks. By updating the mem_limit value in the task_config. See also https://github.com/constantinpape/cluster_tools/blob/master/example/cremi/run_mc.py#L92.

Hope this helps.

MatthewBM commented 5 years ago

Ok thanks that works great,

The furthest I've been able to get is solve_global but that run for over 48 hours which is my job time limit. I've been messing around with the parrelel block size for my 7k x 5k x 400 xyz test volume (400 x 5k x 7k in the .n5 file format).

right now I'm trying block_shape = [80, 1024, 1024]

But I was wondering if you could recommend a value here? I'm allocating 180Gb of RAM to the parallel workers and I can allocate up to 1.45 TB to the single worker steps, I've done that for 'solve_global', 'solve_subproblems', and 'merge_edge_features' so far because they were giving 'out of memory' and 'segmentation fault' errors.

constantinpape commented 5 years ago

right now I'm trying block_shape = [80, 1024, 1024]

That sounds reasonable.

How many nodes are in the graph (i.e. how many super-voxel ids are there?).

Usually the solve_global step should be quite fast if the problem was reduced by solving the subproblems. Have you tried running everything on a smaller cutout of the data (say 200 x 1024 x 1024) and checked the results?

One potential issue could be that your boundary maps follow a different convention than what I expect: I assume boundaries to correspond to high values (i.e. 1 means maximal boundary probability for a pixel). If your boundary maps have the opposite convention, you can set invert_inputs to True for costs_to_probs, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/costs/probs_to_costs.py#L51.

If you use the correct boundary convention and the cutout results look decent, there are 2 options to speed up the final multicut:

  1. Choose a different solver. This can be done by setting agglomerator to greedy-additive in the config of solve_global, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/multicut/solve_global.py#L43.
  2. Run with more hierarchy levels by setting n_scales > 1, see https://github.com/constantinpape/cluster_tools/blob/master/cluster_tools/workflows.py#L207.

These two options can also be combined. Note that both can reduce the quality of the resulting segmentation a bit, but from my experience the effect should not be very significant.

MatthewBM commented 5 years ago

I have 22 million superpixels

high boundary chance is 255, uint8. What's the difference between 'time_limit' and 'time_limit_solver'? This is a cross of my current volume (lots of myelin unfortunately): image

After I get that to work and understand what the time / parallelization values should be, I'd like to try on volumes that are more around the size of 15k x 15k x 1k, like this: image Does that look like it could handle a greedy solver with higher n_scales?

constantinpape commented 5 years ago

I have 22 million superpixels

That should be fine; I have solved problems with about 2 orders of magnitude more superpixels with this pipeline.

high boundary chance is 255, uint8

That's good, you don't need to change invert_inputs then.

What's the difference between 'time_limit' and 'time_limit_solver'?

time_limit is the maximum time a job will run; it is passed as value for the -t parameter to slurm. time_limit_solver is a time limit that is passed to the actual multicut solver.

I forgot to mention this parameter earlier; setting time_limit_solver might actually fix your problem. You should set it to ~ 4 hours less than time_limit. (time_limit_solver is soft, which means that the solver will not abruptly stop after the time has passed, but will only check for it after completing an internal iteration. Depending on the problem size, the iterations can take quite a bit, that's why it's safer to give some leeway compared to time_limit).

Does that look like it could handle a greedy solver with higher n_scales?

Yes, this looks feasible.

MatthewBM commented 5 years ago

ok, looks like time_limit is in minutes (or slurm/sbatch command format) and time_limit_solver is in seconds, correct?

constantinpape commented 5 years ago

yes that's correct

MatthewBM commented 5 years ago

Ok, it's working full process and looks great, I'll email you a video of what it looks like. Thanks again!

constantinpape commented 5 years ago

You\re very welcome and thanks for your patience. I am looking forward to see the results :).