dask / dask-image

Distributed image processing
http://image.dask.org/en/latest/
BSD 3-Clause "New" or "Revised" License
210 stars 47 forks source link

Support multiple chunks in label #29

Closed jakirkham closed 5 years ago

jakirkham commented 6 years ago

From @jakirkham on August 4, 2017 16:2

The current implementation of label in PR ( https://github.com/dask-image/dask-ndmeasure/pull/57 ) does not support multiple chunks, but only 1 chunk along all dimensions. This is still useful when working with individual images that fit in memory. Still to better utilize compute resources and to extend to even larger data, it would be nice to be able to support multiple chunks.

Copied from original issue: dask-image/dask-ndmeasure#58

jakirkham commented 6 years ago

From @kmader on May 23, 2018 7:5

Without supporting multiple chunks the function is just a delayed wrapper on scipy.ndimage.label, an easier starting point might be something that just does approximate labels and returns the label run independently on each chunk

jakirkham commented 6 years ago

From @kmader on May 23, 2018 7:24

The function as it is now works on multiple chunks but it just consolidates all of the chunks to one array before running the label. (https://github.com/dask-image/dask-ndmeasure/pull/83)

Here is the dot_graph for running label on a chunked image

image

jakirkham commented 6 years ago

Good point! I think this is a consequence of dask.delayed's behavior, which is to consolidate chunks. We might consider moving to map_blocks and doing the rechunking ourselves, which may be a bit nicer from an end user perspective.

Ultimately it is true that label as it stands now is not very exciting. For most of the data that we have been using this on thus far, the frames themselves fit in memory, but there may be too many of them to feed all of the frames into scipy.ndimage.label. In this way, label as it exists now solves a real world problem.

However this is not what one thinks of when they think of an out-of-core labeling function. A better solution likely involves da.map_overlap or similar. What's not clear to me is how we handle this joining of labels across block boundaries (particularly if they cross over multiple block boundaries). Maybe you have thoughts on this?

jakirkham commented 6 years ago

From @kmader on May 30, 2018 22:20

It is a nightmarish problem to cover all of the possible edge cases and try and be somewhat efficient. I guess it would be good to try and figure out the most common use cases and make it work reasonably well for them.

Objects << Chunks (Cells in pathology slices)

da.map_overlap with an additional step to delete objects on border or where the midpoint is on the. border

Objects >> Chunks (vessels in full body scan)

downsample image onto one node, label and then upscale labels to chunks

It would probably also make sense to have a division between approximative and exact methods, since there are plenty of cases where approximate results are sufficient.

TomNicholas commented 5 years ago

I was just about to try and use ndmeasure.label to track some objects over time (4-D problem) where the data arrays are from turbulence simulations and won't fit in memory. However I didn't see this issue, and the documentation isn't really clear that all ndmeasure.label does is wrap scipy.ndimage.label with dask.delayed.

I also don't think my problem satisfies either "objects >> chunks" or "chunks << objects". I'm looking for structures which are likely to be localised in some dimensions but extended in others.

It is a nightmarish problem to cover all of the possible edge cases and try and be somewhat efficient.

Is it? Couldn't you parallelize this by repeated application of scipy.ndimage.label?

If you ran label on every chunk separately, you would basically find every feature, they just wouldn't have corresponding labels between adjacent chunks. But you could then run label again on just the edges joining each chunk, which would tell you which label in one chunk corresponded to the same feature in another chunk. That should be enough information to correctly label all features across all chunks right?

It would also hopefully have a roughly linear scaling (assuming your chunk size was appropriate), because the limiting factor should be the time taken to run the initial label on the chunks. Running the second label should be quick because the edge regions would be comparatively small (although there would be 2*N*n_chunks of them for an N-dimensional problem, so if your chunks were too small this step would become the bottleneck).

TomNicholas commented 5 years ago

Could this potentially be done using dask.array.map_overlap? You would map scipy.ndimage.label over all the chunks, and then look in the overlapping regions for positions which have been assigned different labels in different blocks. Setting these labels to be the same across all blocks, trimming (and shifting the labels down so they were separated by 1) and returning would achieve what we want?

As mentioned before then joining features over the overlapping regions is the hard bit. If dask has a way of returning just the overlapping regions then we can create some kind of global dictionary of equivalent labels by mapping over each overlap region. Then we do another map_blocks to apply a function which "de-uniques" each label as needed.

jakirkham commented 5 years ago

...then look in the overlapping regions for positions which have been assigned different labels in different blocks. Setting these labels to be the same across all blocks...

This is the challenging part, I think. It's different enough that we might want to avoid map_overlap and just use the underlying primitives it is built on.

Though maybe you are already thinking this too... :)

If dask has a way of returning just the overlapping regions...

Something we need to think about is objects that pass through multiple chunks. We'd want to make sure this gets labeled the same way, which may mean we need to order how we fix the labels. Could be wrong about this though.

jni commented 5 years ago

I think the correct approach here is to map_overlap with an overlap of 1, then building an overlap graph of objects in different chunks to reconcile the labels, and finally come up with a single labeling. I think @stephenplaza solved this somewhere? With a harder problem because he was dealing with stitching different segmentations. Here we know the labels will overlap exactly. @stephenplaza can you remind us of how you did this? Or tell us I am misremembering. =)

stephenplaza commented 5 years ago

I am not incredibly familiar with dask primitives (though we are probably slowly moving a lot of our spark applications to dask :). The solution I implemented was in spark and is discussed here: https://arxiv.org/abs/1604.00385 (and available in github https://github.com/janelia-flyem/DVIDSparkServices within the CreateSegmentation plugin for the brave that can untangle it from everything else). It is really just as you would think, extract overlap regions and group those regions together. Once mappings are found, the presumably much smaller set of mappings are resolved at the driver (though more sophisticated strategies could be employed to avoid doing that). At least for my problem, the labels were noisy and overlap regions needed to be more than 1 pixel ... additional heuristics were needed to do the matching in a robust way given this noise.

jni commented 5 years ago

Thanks @stephenplaza! In our case we are just labeling connected components in a bigger-than-memory volume, so we are guaranteed exact overlaps. This makes the problem a lot neater. It is the spherical problem to your cow-shaped problem. =D We should be able to piece it together from the paper, as I recall I got a clear idea from it. :+1:

qedq commented 5 years ago

Ah, connected componnents would be a lot easier and we also have an algorithm for that in the same software package (overlap 1 as you would expect). I actually use it when evaluating segmentations within a much larger segmentation https://www.frontiersin.org/articles/10.3389/fncir.2018.00102/full

stephenplaza commented 5 years ago

Sorry, used my other GitHub account in the previous post :). You can actually find the spark algorithm for connected components here https://github.com/janelia-flyem/DVIDSparkServices/blob/master/DVIDSparkServices/workflows/EvaluateSeg.py

ebo commented 5 years ago

I have been recently processing anywhere from 1.3 gigpixel to 2.5 gigpixel multiband images (and in one test case a 25 gigapixel 8 band image). The algorithms I hacked to work out-of-core allowed me to do image labeling and other fun stuff on those images running on AWS instances with all of 3GB of RAM... I simply cannot suff those images into any but the biggest nodes with 100+GB of RAM without playing games with memory mapped files and the like.

Last March I posted a query to the Pango folks looking for insight on how to leverage dask in a block-wise fashion (see https://github.com/pangeo-data/pangeo/issues/183 ). It turns out that rasterio gives you a mechanism to know which row/column of image blocks, as well as the entire chunked description of the multidimensional array (I think they called it window(s), but would have to review the docs. I have some code that can label an image chunk with an an init ID offset which can be generated to guarantee a globally unique labeling across all the chunks. This would allow the dot_graph above to go from getter (FromImage) => label => finalize (where finalize unions the unique IDs and then collapses the IDs to be contiguous).

The problem I/we ran into is that I have not been able to find info on the block/chunk row/column, etc., necessary to generate the unique IDs. Remember that this info is provided in rasterio, but in a completely independent source.

Looking at the dot_graph image above, maybe that information is in getter or where it generates the FromImage message.

BTW, I am sorry that I missed this thread until now. I will have some time over the weekend I can take at least a brief look at this -- assuming that the sprint is still ongoing (or if people are willing to provide feedback if ended).

jni commented 5 years ago

@ebo have a look at #94 and the graph therein. We now have distributed labeling implemented and it's just waiting for @jakirkham to move a few functions around and replace the scikit-image utility function with a vendored implementation.

ebo commented 5 years ago

I will take a look at this and replace or rewrite the watershed and/or other algorithms around your (how should we describe it -- parallel) labeling/segmentation?