dask / dask-image

Distributed image processing
http://image.dask.org/en/latest/
BSD 3-Clause "New" or "Revised" License
210 stars 47 forks source link

Thresholding #100

Open jakirkham opened 5 years ago

jakirkham commented 5 years ago

Raised by @jni during the sprint, it would be nice to support some common thresholding methods. For the most part these should be pretty straightforward to implement. Would be good to gather some ideas about which methods are useful to include/support.

GenevieveBuckley commented 5 years ago

This discussion on the image.sc forum is extremely relevant: https://forum.image.sc/t/whats-your-favorite-thresholding-method-in-cellprofiler/14160

The big ones mentioned:

GenevieveBuckley commented 5 years ago

You might also take a look at the scikit-image thresholding methods, because python image processing users are likely to be using some of these already. I'd pick where to start by what would be the least amount of effort to get running, personally.

The core scikit-image thresholding methods are: 'threshold_otsu', 'threshold_yen', 'threshold_isodata', 'threshold_li', 'threshold_local', 'threshold_minimum', 'threshold_mean', 'threshold_niblack', 'threshold_sauvola', 'threshold_triangle'

GenevieveBuckley commented 5 years ago

And also - local thresholding methods are very important scientifically, but would a pain to implement straight away. I think that's more realistic as a long term goal than a short term one.

jni commented 5 years ago

I wonder what those robust local ones are, as they would be desirable to implement in skimage.

I think local thresholding is easier to implement because you can just map_overlap it with the correct window size. Global methods will require more communication between blocks.

jakirkham commented 5 years ago

Thanks for writing this up @GenevieveBuckley. Have skimmed through the info.

Agree that local thresholding can be quite useful. As @jni notes local thresholds should be pretty straightforward as we can wrap an existing implementation from scikit-image with map_blocks or map_overlap. It's probably worth giving some thought to which ones we want to wrap and how to determine the overlap between blocks (if there is any) for these cases.

Global methods may require more communication, but may not be difficult to implement depending on what the method entails. An independent question is how performant the implementation would be and thus whether it is worth pursuing.

Some global cases may be so trivial as to be solved with a one line change in scikit-image. ( https://github.com/scikit-image/scikit-image/issues/3782 ) If that winds up not being possible, these cases could be trivially solved here as well.

Many of the other global algorithms involve binning the data and computing a histogram. There is already a way to compute a histogram for Dask Arrays, but it may require some tweaks for our use case ( https://github.com/dask/dask/issues/4544 ). As we will want to compute the min/max of the data before computing the histogram, this will need two passes over the data. Maybe there are clever ways to do this in one pass? If not, would we be comfortable performing two passes? It's also worth keeping issue ( https://github.com/dask/dask/issues/874 ) in mind.

The worst global algorithms to implement are ones that require significant iteration and possibly terminate only as a direct consequence of the result's content. The reason being DAGs can at best unroll loops, which only works well if we know how many loops there are a priori and that number is not prohibitively large. These usually require persisting intermediate results somewhere (e.g. memory, particular disk drive, etc.); thus, these are hard to implement generally. threshold_minimum and threshold_li appears to fit in this category. My inclination is to avoid these for now unless there are some clear use cases.

GenevieveBuckley commented 5 years ago

Local thresholding: Good point @jni , I hadn't thought that through. Happy to hear this might not be too painful to implement!

Global thresholding methods that involve histogram calculations only: This seems like low-ish hanging fruit to me.

Thresholding algorithms requiring iteration: I agree with @jakirkham , we should leave these alone unless there is a clear and pressing need for it.

GenevieveBuckley commented 5 years ago

Links to investigate CellProfiler's robust thresholding method later:

"RobustBackground: This method assumes that the background distribution approximates a Gaussian by trimming the brightes and dimmest X% of pixel intensities where you choose a suitable percentage. It then calculates the mean and standard deviation of the remaining pixels and calculates the threshold as the mean + N times the standard deviation, where again you choose the number of standard deviations to suit your images."

jakirkham commented 5 years ago

Thanks for filling in the robust thresholding method.

Percentages are usually tricky to factor in as these imply sorting the data. Generally sorting on arbitrarily large data is problematic.

Finding the k-th largest (or smallest) values is possible with topk, but there is an important caveat.

This performs best when k is much smaller than the chunk size.

So rechunking beforehand may be needed to get reasonable performance.

GenevieveBuckley commented 5 years ago

It seems most of scikit-image's thresholding implementations make use of the numpy.ndarray.flat flat iterator object, which is not compatible with dask arrays.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flat.html

jakirkham commented 5 years ago

Do you have some examples of ones that fail? Is there a particular reason that flat is used as opposed to ravel?

GenevieveBuckley commented 5 years ago

Hang on, this might be different for the scikit-image master branch, rather skimage v0.14. Fingers crossed!

GenevieveBuckley commented 5 years ago

I'm having trouble pip installing the development version of skimage into a conda environment. It says I'm missing the Microsoft Visual C++ Build Tools, but I have definitely installed (and re-installed) those. Perhaps I need to add it to my path or something?

This is the first time I've tried to set up a scikit-image development environment on a Windows machine :(

jakirkham commented 5 years ago

Guessing you probably need to run vcvarsall.bat as that sets the environment variables for you. Maybe this doc is useful?

Alternatively you could use conda-build to build the conda-forge recipe for scikit-image. Would need to tweak it so it uses git for the source.

jni commented 5 years ago

@jakirkham

Is there a particular reason that flat is used as opposed to ravel?

as I understand it, this is only because some contributors think it looks nicer, and we had no reason to prefer one over the other in those instances. I would not be opposed to blanket changing all of them to ravel.

jakirkham commented 5 years ago

Looking at thresholding.py (are there other thresholds outside of this?), it appears the only use of .flat is in threshold_li, which it sounds like we don't want to worry about implementing yet. So maybe this is not an issue any more (as Genevieve suspected).

GenevieveBuckley commented 5 years ago

Looking at thresholding.py (are there other thresholds outside of this?), it appears the only use of .flat is in threshold_li...

There are a few .flat in skimage/exposure/exposure.py too. A recent change to the master branch has removed .flat from skimage.exposure.histogram(), which is good. But there are still some other problems. To summarize:

numpy.histogram()

skimage.exposure.histogram

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-3b598f2d37d4> in <module>
----> 1 histogram(camera_int)

~\Documents\GitHub\dask-hacking\thresholding\exposure.py in histogram(image, nbins, source_range, normalize)
    129     # For integer types, histogramming with bincount is more efficient.
    130     if np.issubdtype(image.dtype, np.integer):
--> 131         hist, bin_centers = _bincount_histogram(image, source_range)
    132     else:
    133         if source_range == 'image':

~\Documents\GitHub\dask-hacking\thresholding\exposure.py in _bincount_histogram(image, source_range)
     67         image_min, image_max = dtype_limits(image, clip_negative=False)
     68     image, offset = _offset_array(image, image_min, image_max)
---> 69     hist = np.bincount(image.ravel(), minlength=image_max - image_min + 1)
     70     bin_centers = np.arange(image_min, image_max + 1)
     71     if source_range == 'image':

TypeError: 'Array' object cannot be interpreted as an integer

dask.array.histogram()

When I try it on arrays with integer values, the returned hist and bin_centers have mismatched lengths (bin_centers is one element longer than hist is), which makes it a bit fiddlier than a drop in replacement.

Maybe this is straight up user error - if so please point it out!

hist, bin_centers = da..histogram(image.ravel(), bins=256, range=[0, 255])

We'd also probably want to fix up dask issue 4544.

How to reproduce

Your environment should have:

In ipython

import numpy as np
import dask.array as da
import skimage.data
import skimage.exposure

camera_numpyint = skimage.data.camera()
camera_numpyfloat = camera_numpyint / 255
camera_int = da.from_array(camera_numpyint, chunks=[2,2])
camera_float = da.from_array(camera_numpyfloat, chunks=[2,2])

type(camera_numpyint[0][0])
type(camera_numpyfloat[0][0])
type(camera_float[0][0].compute())
type(camera_int[0][0].compute())

np.histogram(camera_numpyint)
np.histogram(camera_numpyfloat)
np.histogram(camera_float)
np.histogram(camera_int)

skimage.exposure.histogram(camera_numpyint)
skimage.exposure.histogram(camera_numpyfloat)
skimage.exposure.histogram(camera_float)
skimage.exposure.histogram(camera_int)  # doesn't work, see traceback above
jni commented 5 years ago

Just a quick comment that dask array histogram is probably modelled after numpy histogram, which uses bin edges, rather than skimage histogram, which uses bin centres.

GenevieveBuckley commented 5 years ago

Ok, I think I've got my head around it now. See PR https://github.com/scikit-image/scikit-image/pull/3823

To give you a rough idea of performance: With the very small image (512, 512):

In [10]: %timeit thresh = thresholding.threshold_otsu(camera_float)
254 ms ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With a bigger random array (10000, 10000):

In [34]: arr
Out[34]: dask.array<random_sample, shape=(10000, 10000), dtype=float64, chunksize=(1000, 1000)>

In [35]: %timeit thresholding.threshold_otsu(arr)
13.1 s ± 1.59 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: %timeit thresholding.threshold_minimum(arr)
10.3 s ± 972 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
GenevieveBuckley commented 5 years ago

threshold_li is the last remaining global threshold method in scikit-image that throws errors with dask arrays. Here's why:

  1. Easy to fix: image.flat[0] should become image.ravel()[0]
  2. Harder to fix: The check to remove any NaN values from the array causes dask to have a small existential crisis, since now it doesn't know what size the array chunks should be. @jakirkham is there a more dask-like way of removing NaNs I should know about? (I do know da.isfinite(), but that's not quite the same thing that's happening here - link)
jni commented 5 years ago

Yeah looking at the function I think it's going to be super tricky to fix that. Worse: we were thinking of adding those checks to all compatible thresholding functions. :joy:

jakirkham commented 5 years ago

There was some discussion about computing a Dask Array and using that to fill in the chunk sizes. ( https://github.com/dask/dask/issues/3293 ) This isn't without tradeoffs of course, but in cases where this is really needed it does help.

Could you please provide a bit more context on why we need to worry about unknown array shape in this case?

GenevieveBuckley commented 5 years ago

To clarify: the unknown array shape is a problem because it means you cannot index the array anymore, and that is something that happens almost immediately with image.ravel()[0].

Minimal example:

import dask.array as da
import numpy as np
import skimage.data

image = da.from_array(skimage.data.camera(), chunks=(256, 256))
image = image[~np.isnan(image)]
np.all(image == image.ravel()[0])

For context this last line happens because we want to check if the image only has one intensity value, in which case well just return a threshold equal to that intensity.

Traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-43-7d4740260598> in <module>
----> 1 np.all(image == image.ravel()[0])

~\AppData\Local\conda\conda\envs\dask\lib\site-packages\dask\array\core.py in __getitem__(self, index)
   1153
   1154         out = 'getitem-' + tokenize(self, index2)
-> 1155         dsk, chunks = slice_array(out, self.name, self.chunks, index2)
   1156
   1157         graph = HighLevelGraph.from_collections(out, dsk, dependencies=[self])

~\AppData\Local\conda\conda\envs\dask\lib\site-packages\dask\array\slicing.py in slice_array(out_name, in_name, blockdims, index)
    156
    157     # Pass down to next function
--> 158     dsk_out, bd_out = slice_with_newaxes(out_name, in_name, blockdims, index)
    159
    160     bd_out = tuple(map(tuple, bd_out))

~\AppData\Local\conda\conda\envs\dask\lib\site-packages\dask\array\slicing.py in slice_with_newaxes(out_name, in_name, blockdims, index)
    178
    179     # Pass down and do work
--> 180     dsk, blockdims2 = slice_wrap_lists(out_name, in_name, blockdims, index2)
    181
    182     if where_none:

~\AppData\Local\conda\conda\envs\dask\lib\site-packages\dask\array\slicing.py in slice_wrap_lists(out_name, in_name, blockdims, index)
    233     # No lists, hooray! just use slice_slices_and_integers
    234     if not where_list:
--> 235         return slice_slices_and_integers(out_name, in_name, blockdims, index)
    236
    237     # Replace all lists with full slices  [3, 1, 0] -> slice(None, None, None)

~\AppData\Local\conda\conda\envs\dask\lib\site-packages\dask\array\slicing.py in slice_slices_and_integers(out_name, in_name, blockdims, index)
    278     for dim, ind in zip(shape, index):
    279         if np.isnan(dim) and ind != slice(None, None, None):
--> 280             raise ValueError("Arrays chunk sizes are unknown: %s", shape)
    281
    282     assert all(isinstance(ind, (slice, Integral)) for ind in index)

ValueError: ('Arrays chunk sizes are unknown: %s', (nan,))
jakirkham commented 5 years ago

Sure, I'll rephrase then, why do we want to index the array?

GenevieveBuckley commented 5 years ago

Sure, I'll rephrase then, why do we want to index the array?

We're checking to see if the image is entirely one single shade of grey (i.e. all intensity values in the are identical). If that's the case we don't go any further & return from the function.

A possible alternative:

It doesn't look like you could integrate this with scikit-image easily (sorry @jni), but correct me if I'm wrong.

Things we can't do:

jakirkham commented 4 years ago

FWIW I've filed an issue about handling Future objects within Dask ( https://github.com/dask/dask/issues/6380 ). Assuming this is agreeable to others, this may help us unpick use cases where we need to do some intermediate computation as part of determining the final result.