Open jrbourbeau opened 1 year ago
@GenevieveBuckley this is part of our broader effort to include large-scale, representative user workflows in examples and decisions that inform future development of Dask (xref https://github.com/coiled/coiled-runtime/issues/725). I'd love to get your thoughts on what you think makes a good representative image analysis workflow. Thoughts on a nice public dataset? Maybe you already have a notebook that shows this type of workflow off?
I'd say typical image analysis workflows often have this structure:
Loading data
Preprocessing images could include things like:
Identifying objects usually involves
Measuring objects
Thanks @GenevieveBuckley . I don't suppose you or someone around you has a representative workflow lying around? My guess is that if we have a non-imaging person try to construct a normal workflow it'll be both time consuming and miss important aspects of the work. I'm hopeful that you know of some dataset and canonical notebook somewhere that we could clean up and use instead of trying to invent something from scratch.
Maybe you already have a notebook that shows this type of workflow off?
I gave a talk at PyConAU a few years ago, and it does include a basic image segmentation workflow. This isn't the most exciting example, because it's a stack of separate 2d images and embarrassingly parallel, but it was good for showing all the dask-image features we had.
Nick Sofreniew also has a nice demo from a few years ago showing interactive cell segmentation. The context here was showing off what you could do with napari, and how interactivity makes things a lot easier. Here's the jupyter notebook (there are two other jupyter notebooks that come before that one, so if things don't make sense you can skip back a bit if you need) and corresponding YouTube video.
Thoughts on a nice public dataset?
Lots of thoughts! You might also ask @joshmoore, I think there are some good datasets already in .zarr format on the Image Data Resource (IDR) (plus you should be able to preview them remotely with napari).
A timeseries microscopy dataset, like these developing insect embyros, would be stunning. It would fit very easily into a workflow to load the data, filter the images, threshold bright objects to segment the nuclei, then count & measure them.
http://celltrackingchallenge.net/3d-datasets/
Developing Tribolium Castaneum embryo (red flour beetle)
Or if the one above is too big, you could try this smaller embryo
C.elegans developing embryo (worm embryo)
I also think a lattice lightsheet dataset could make a good example. Talley has a nice example (and has turned it into a plugin) deskewing a lattice lightsheet dataset, and you can download the example data.
The napari-lattice wiki pages have some more detail about the desired workflow (load, deskew, crop, deconvolve, etc.), which might be helpful for background context.
Janelia makes volume electron microscopy datasets publically available on AWS: https://openorganelle.janelia.org/
Possibly also useful are the Caltech Electron Tomography Database and/or EMPIAR (the Electron Microscopy Public Image Archive). They might be a bit harder to search through for something suitable though.
I know of the CAMELYON challenge has publicly available histology. But it can be tricky to download the images (I think they're hosted on google drive, so it works ok manually if you do it once or twice - but repeated downloads will trigger rate limiting or blocks), and also it needs to be converted to zarr (this WSI reader might be helpful?). https://github.com/dask/dask-image/issues/125
We've been collecting lists of datasets in these issues:
~There is a list of OME-Zarr related datasets under https://ngff.readthedocs.io/en/rtd/data/index.html (though that URL will be updated to https://ngff.openmicroscopy.org/data once I'm back from travels).~
There is a list of OME-Zarr related datasets under https://ngff.openmicroscopy.org/data
If other links end up here, it'd be great to also point to them from that resource.
One that needs adding, for example, is http://zebrahub.org/
What I'm seeing here is a bunch of "here is a bunch of raw material that you could use" which is great. Thank you all.
However, I suspect that @jrbourbeau is now in a position where he's being asked to judge a bunch of stuff that he doesn't understand at all. If anyone has the time to say "This workload (or two) is computationally representative of the image processing community. It is what we would want included in a regularly run Dask benchmark suite." that would be welcome.
Otherwise, my guess is that James will choose one or two of these at random and it won't be very good. (no offense James)
That's fair. You don't want lots of ideas, we need just one or two good ones.
Let's do these two things:
It might also be helpful if James or someone could explain what the purpose of these benchmarks are.
.compute()
calls on different sections of the data, That's fair. You don't want lots of ideas, we need just one or two good ones.
Good summary!
Dask-only is less realistic
Realistic is good.
You'll likely have to host the image data used for these benchmarks somewhere yourself, and maybe convert them to zarr first, is that requirement ok?
It's less fun, but also totally ok.
From my perspective there are two overlapping objectives:
For something like a Napari-like interactive session probably objective two doesn't make sense in this format (benchmarks, notebooks, etc..) and we're mostly looking for objective one, making sure that the dask engineers are sensitive to the computational needs of this group.
Embarrassingly parallel computation, on 2D fluorescence microscopy images of cell nuclei.
Dataset: BBBC039 datset Code: From this talk
import numpy as np
from dask_image.imread import imread
from dask_image import ndfilters, ndmorph, ndmeasure
images = imread('data/BBBC039/images/*.tif')
smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])
thresh = ndfilters.threshold_local(smoothed, blocksize=images.chunksize)
threshold_images = smoothed > thresh
structuring_element = np.array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 1, 0], [1, 1, 1], [0, 1, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_closing(threshold_image, structure=structuring_element)
label_images, num_features = ndmeasure.label(binary_image)
index = np.arange(num_features)
area = ndmeasure.area(images, label_images, index)
mean_intensity = ndmeasure.mean(images, label_images, index)
# ... compute things! (index, area, mean_intensity) and save as output csv, probably
# ideally don't unnecessarily re-compute parts of the dask task graph multiple times (not that you would do this... but I sometimes do this without thinking about it, oops)
# ... maybe make some graphs of that output. Eg: histogram or violin plot of area, mean_intensity; maybe a 2d plot of area vs mean_intensity...
What you compute depends on why you want to make these benchmarks.
As I said above, if you are mimicking interactive user behaviour, you'll want to get a random few image regions of the raw data, and the same after most of the different steps. Users will often change variables a few times (like the sigma value for the gaussian smoothing, or different threshold values for the mask) and compare the output to pick the one they like best.
After this interactive period, setting it up and choosing all the parameter values, people generally would compute the whole dataset. Saving the output measurements to csv and generating some summary statistics/graphs would be pretty standard.
In the talk I made the plotted nuclei area vs mean intensity, but only for the first three images so it would run quickly. Mean intensity is not a very meaningful thing to measure, but it is easy to do and I couldn't think of anything better to replace it with. I think that's fine for a benchmark or demo.
OK, so I'm hearing that we can have a large graph like this, and then we can do two things:
[x[..., ...].compute() for ... in range(10)]
or something like thatThat would let us cover two different important cases with the same code
Hopefully this also generates some nice images that draw users in?
If what I'm hearing is correct then this sounds easy and doable (although I'll let @jrbourbeau determine that when he gets up).
Image analysis example 2: A nuclei segmentation example using the red flour beetle ("Developing Tribolium Castaneum embryo") from the Cell Tracking Challenge
Here's some half-finished work. Gist: https://gist.github.com/GenevieveBuckley/41f49c56640c155f68c346b82c04e803
I'm having a problem at the last step, going from a simple threshold mask of all the nuclei, to individual labels for each separate nucleus. There's nothing stopping you from using this and leaving the last part out, and maybe I'll figure it out later. I've got to go home now though, so I'm sharing what I've got so far.
In the gist, we have:
Here's a possible approach to fix the final labelling step for the red flour beetle.
Here we go, perhaps this will be the fix for the dask problem
https://forum.image.sc/t/doing-watershed-with-large-image/77283/2
use dask.array.map_overlap 2 with scikit-image watershed using watershed_lines=True. This makes sure that each segment is separated from its neighbors by black/background pixels. Where the overlaps end, your segments will be cut by a vertical line, but not by the background. then, use dask_image.ndmeasure.label 1 to label the connected components. This should unify the segments at the window boundary, but not the ones that are separated by the watershed line.
I don't like having to use watershed_lines=True on the red flour beetle data (the nuclei are packed pretty tightly and they're not that big). But if it's the easiest way to get a result, we might just go with it.
This does fall into the "here is a bunch of raw material that you could use" category, so disregard it if you like.
But I found a Kaggle "3D Image Analysis using Dask" notebook, which is a nice in-the-wild example.
It looks like it was made as an exercise for this course. I don't know the author, but I think this is the right K Mader.
I suspect there'd be some work involved if you wanted to turn the notebook into a benchmark. There are several questions/exercises at the end of the notebook, and it might be possible that there is another version with example solutions to those parts floating around as well. Could be something to keep in mind for the future.
(This note is more for me, rather than James or anyone else at Coiled. I'm not expecting anyone else to look through Robert's code, but I want to link it in case there is anything that might also be useful here)
Potentially useful resource for the red-flour-beetle 3D example:
Robert Haase has written some clesperanto demo scripts segmenting cells in 3D from a tribolium embryo (red flour beetle). It's not exactly the same dataset as the cell tracking challenge one I've linked above, but looks quite similar. The one he uses has been downsampled to 1x1x1mm voxels, which is roughly similar to the lowest resolution level of the zarr file I was playing with here.
There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.
A representative example would be to:
from_zarr
map_overlap
with a trivial function