coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
27 stars 17 forks source link

Image analysis #751

Open jrbourbeau opened 1 year ago

jrbourbeau commented 1 year ago

There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.

A representative example would be to:

  1. Load data with from_zarr
  2. Perform map_overlap with a trivial function
  3. Then select a sequence of regions one after the other
jrbourbeau commented 1 year ago

@GenevieveBuckley this is part of our broader effort to include large-scale, representative user workflows in examples and decisions that inform future development of Dask (xref https://github.com/coiled/coiled-runtime/issues/725). I'd love to get your thoughts on what you think makes a good representative image analysis workflow. Thoughts on a nice public dataset? Maybe you already have a notebook that shows this type of workflow off?

GenevieveBuckley commented 1 year ago

I'd say typical image analysis workflows often have this structure:

  1. Loading data
  2. Preprocessing images (an interactive preview here would help choosing the right parameters)
  3. Identifying objects (also a good spot for interactive use, you almost always need to manually correct segmentations in some way)
  4. Measuring objects

Loading data

Preprocessing images could include things like:

Identifying objects usually involves

Measuring objects

mrocklin commented 1 year ago

Thanks @GenevieveBuckley . I don't suppose you or someone around you has a representative workflow lying around? My guess is that if we have a non-imaging person try to construct a normal workflow it'll be both time consuming and miss important aspects of the work. I'm hopeful that you know of some dataset and canonical notebook somewhere that we could clean up and use instead of trying to invent something from scratch.

GenevieveBuckley commented 1 year ago

Maybe you already have a notebook that shows this type of workflow off?

I gave a talk at PyConAU a few years ago, and it does include a basic image segmentation workflow. This isn't the most exciting example, because it's a stack of separate 2d images and embarrassingly parallel, but it was good for showing all the dask-image features we had.

Nick Sofreniew also has a nice demo from a few years ago showing interactive cell segmentation. The context here was showing off what you could do with napari, and how interactivity makes things a lot easier. Here's the jupyter notebook (there are two other jupyter notebooks that come before that one, so if things don't make sense you can skip back a bit if you need) and corresponding YouTube video.

GenevieveBuckley commented 1 year ago

Thoughts on a nice public dataset?

Lots of thoughts! You might also ask @joshmoore, I think there are some good datasets already in .zarr format on the Image Data Resource (IDR) (plus you should be able to preview them remotely with napari).

MRI dataset

Timeseries light microscopy

A timeseries microscopy dataset, like these developing insect embyros, would be stunning. It would fit very easily into a workflow to load the data, filter the images, threshold bright objects to segment the nuclei, then count & measure them.

http://celltrackingchallenge.net/3d-datasets/

Developing Tribolium Castaneum embryo (red flour beetle)

Or if the one above is too big, you could try this smaller embryo

C.elegans developing embryo (worm embryo)

Lattice lightsheet microscopy

I also think a lattice lightsheet dataset could make a good example. Talley has a nice example (and has turned it into a plugin) deskewing a lattice lightsheet dataset, and you can download the example data.

The napari-lattice wiki pages have some more detail about the desired workflow (load, deskew, crop, deconvolve, etc.), which might be helpful for background context.

Electron imaging dataset

Janelia makes volume electron microscopy datasets publically available on AWS: https://openorganelle.janelia.org/

Possibly also useful are the Caltech Electron Tomography Database and/or EMPIAR (the Electron Microscopy Public Image Archive). They might be a bit harder to search through for something suitable though.

Histology whole slide image(s)

I know of the CAMELYON challenge has publicly available histology. But it can be tricky to download the images (I think they're hosted on google drive, so it works ok manually if you do it once or twice - but repeated downloads will trigger rate limiting or blocks), and also it needs to be converted to zarr (this WSI reader might be helpful?). https://github.com/dask/dask-image/issues/125

More datasets

We've been collecting lists of datasets in these issues:

joshmoore commented 1 year ago

~There is a list of OME-Zarr related datasets under https://ngff.readthedocs.io/en/rtd/data/index.html (though that URL will be updated to https://ngff.openmicroscopy.org/data once I'm back from travels).~

There is a list of OME-Zarr related datasets under https://ngff.openmicroscopy.org/data

If other links end up here, it'd be great to also point to them from that resource.

One that needs adding, for example, is http://zebrahub.org/

mrocklin commented 1 year ago

What I'm seeing here is a bunch of "here is a bunch of raw material that you could use" which is great. Thank you all.

However, I suspect that @jrbourbeau is now in a position where he's being asked to judge a bunch of stuff that he doesn't understand at all. If anyone has the time to say "This workload (or two) is computationally representative of the image processing community. It is what we would want included in a regularly run Dask benchmark suite." that would be welcome.

Otherwise, my guess is that James will choose one or two of these at random and it won't be very good. (no offense James)

GenevieveBuckley commented 1 year ago

That's fair. You don't want lots of ideas, we need just one or two good ones.

Let's do these two things:

  1. Image analysis example 1: You can use the example I wrote for this talk using the BBBC039 dataset, since it can be used pretty much verbatim.
  2. Image analysis example 2: A nuclei segmentation example using the red flour beetle ("Developing Tribolium Castaneum embryo") from the Cell Tracking Challenge. If you wait a little bit, I'll upload a notebook or script to get you started.
GenevieveBuckley commented 1 year ago

It might also be helpful if James or someone could explain what the purpose of these benchmarks are.

mrocklin commented 1 year ago

That's fair. You don't want lots of ideas, we need just one or two good ones.

Good summary!

Dask-only is less realistic

Realistic is good.

You'll likely have to host the image data used for these benchmarks somewhere yourself, and maybe convert them to zarr first, is that requirement ok?

It's less fun, but also totally ok.

From my perspective there are two overlapping objectives:

  1. Give OSS Dask engineers information about common use case performance as they develop on Dask, and make sure that important workloads continue to improve and are the focus of attention.
  2. Give new users examples that they can see, get excited by, and then copy-paste-edit

For something like a Napari-like interactive session probably objective two doesn't make sense in this format (benchmarks, notebooks, etc..) and we're mostly looking for objective one, making sure that the dask engineers are sensitive to the computational needs of this group.

GenevieveBuckley commented 1 year ago

Image analysis example 1

Embarrassingly parallel computation, on 2D fluorescence microscopy images of cell nuclei.

Dataset: BBBC039 datset Code: From this talk

import numpy as np
from dask_image.imread import imread
from dask_image import ndfilters, ndmorph, ndmeasure

images = imread('data/BBBC039/images/*.tif')
smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])
thresh = ndfilters.threshold_local(smoothed, blocksize=images.chunksize)
threshold_images = smoothed > thresh
structuring_element = np.array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 1, 0], [1, 1, 1], [0, 1, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_closing(threshold_image, structure=structuring_element)
label_images, num_features = ndmeasure.label(binary_image)
index = np.arange(num_features)
area = ndmeasure.area(images, label_images, index)
mean_intensity = ndmeasure.mean(images, label_images, index)
# ... compute things! (index, area, mean_intensity) and save as output csv, probably
# ideally don't unnecessarily re-compute parts of the dask task graph multiple times (not that you would do this... but I sometimes do this without thinking about it, oops)
# ... maybe make some graphs of that output. Eg: histogram or violin plot of area, mean_intensity; maybe a 2d plot of area vs mean_intensity...

What you compute depends on why you want to make these benchmarks.

As I said above, if you are mimicking interactive user behaviour, you'll want to get a random few image regions of the raw data, and the same after most of the different steps. Users will often change variables a few times (like the sigma value for the gaussian smoothing, or different threshold values for the mask) and compare the output to pick the one they like best.

After this interactive period, setting it up and choosing all the parameter values, people generally would compute the whole dataset. Saving the output measurements to csv and generating some summary statistics/graphs would be pretty standard.

In the talk I made the plotted nuclei area vs mean intensity, but only for the first three images so it would run quickly. Mean intensity is not a very meaningful thing to measure, but it is easy to do and I couldn't think of anything better to replace it with. I think that's fine for a benchmark or demo.

mrocklin commented 1 year ago

OK, so I'm hearing that we can have a large graph like this, and then we can do two things:

  1. Full compute/persist
  2. [x[..., ...].compute() for ... in range(10)] or something like that

That would let us cover two different important cases with the same code

Hopefully this also generates some nice images that draw users in?

mrocklin commented 1 year ago

If what I'm hearing is correct then this sounds easy and doable (although I'll let @jrbourbeau determine that when he gets up).

GenevieveBuckley commented 1 year ago

Image analysis example 2: A nuclei segmentation example using the red flour beetle ("Developing Tribolium Castaneum embryo") from the Cell Tracking Challenge

Here's some half-finished work. Gist: https://gist.github.com/GenevieveBuckley/41f49c56640c155f68c346b82c04e803

I'm having a problem at the last step, going from a simple threshold mask of all the nuclei, to individual labels for each separate nucleus. There's nothing stopping you from using this and leaving the last part out, and maybe I'll figure it out later. I've got to go home now though, so I'm sharing what I've got so far.

In the gist, we have:

GenevieveBuckley commented 1 year ago

Here's a possible approach to fix the final labelling step for the red flour beetle.

Here we go, perhaps this will be the fix for the dask problem

https://forum.image.sc/t/doing-watershed-with-large-image/77283/2

use dask.array.map_overlap 2 with scikit-image watershed using watershed_lines=True. This makes sure that each segment is separated from its neighbors by black/background pixels. Where the overlaps end, your segments will be cut by a vertical line, but not by the background. then, use dask_image.ndmeasure.label 1 to label the connected components. This should unify the segments at the window boundary, but not the ones that are separated by the watershed line.

I don't like having to use watershed_lines=True on the red flour beetle data (the nuclei are packed pretty tightly and they're not that big). But if it's the easiest way to get a result, we might just go with it.

GenevieveBuckley commented 1 year ago

This does fall into the "here is a bunch of raw material that you could use" category, so disregard it if you like.

But I found a Kaggle "3D Image Analysis using Dask" notebook, which is a nice in-the-wild example.

It looks like it was made as an exercise for this course. I don't know the author, but I think this is the right K Mader.

I suspect there'd be some work involved if you wanted to turn the notebook into a benchmark. There are several questions/exercises at the end of the notebook, and it might be possible that there is another version with example solutions to those parts floating around as well. Could be something to keep in mind for the future.

GenevieveBuckley commented 1 year ago

(This note is more for me, rather than James or anyone else at Coiled. I'm not expecting anyone else to look through Robert's code, but I want to link it in case there is anything that might also be useful here)

Potentially useful resource for the red-flour-beetle 3D example:

Robert Haase has written some clesperanto demo scripts segmenting cells in 3D from a tribolium embryo (red flour beetle). It's not exactly the same dataset as the cell tracking challenge one I've linked above, but looks quite similar. The one he uses has been downsampled to 1x1x1mm voxels, which is roughly similar to the lowest resolution level of the zarr file I was playing with here.