Closed dchaley closed 1 year ago
Comments reference an academic paper as basis for implementation:
Robinson, Kevin and Whelan, Paul F. (2004) Efficient morphological reconstruction: a downhill filter. Pattern Recognition Letters, 25 (15). pp. 1759-1767. ISSN 0167-8655
Available from Dublin City University or the Vision Systems Group lab page. Full text PDF: whelan_2004_104.pdf
Paper references source code but the URL is broken. I think I found it on the internet archive: prl_DownhillFilter.zip
Morphological reconstruction is a "well studied" area. I found another interesting paper:
2012: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF
It references a 2010 paper by Pavel Karas: Efficient Computation of Morphological Greyscale Reconstruction. Dagstuhl Research Online Publication Server, Full text pdf
Question: how does scikit implementation compare vs these algorithms? Can we do a side-by-side of software algorithms, without introducing GPUs. (These papers support the ability to implement GPU algorithms if we need it)
The Downhill Filter pseudocode from whelan_2004_104.pdf, page 7, translated.
The mask image is the input image. The markers are the DeepCell model outputs.
m
from the marker image (model output).in English: the number m
, in the marker pixel values, such that for every pixel value p
, m >= p
in English: For all image points, if the point's marker value i
is non-zero, then, append that point to the height list L
at the height aka index i
.
For each height list in L
, for height n
from highest to lowest:
n
is not empty,p
, the first pixel at height n
.p
from the height list at n
.For each of p
's neighbors q
:
If the mask (input) image pixel value at q
is non-zero, and, the new marker value at q
has not been assigned,
Set the new marker value at q
to the smallest of height n
and the neighbor's value.
Assign the neighbor point q
to the height list at the resulting smallest height.
✨ interesting finding ✨ the Downhill Filter paper doesn't mention sorting. Sorting appears to be expensive in the scikit implementation.
Note that rank_order performs another argsort: sort_order = flat_image.argsort().astype(unsigned_dtype, copy=False)
source code. For some reason it's not pictured here.
We're sorting twice, do we actually have to? Can we reuse the sort in rank-order?
In principle we don't need to sort: we just need to group the pixels by intensity. Would it be faster to do more literally as the algorithm says, and create a list per intensity value? What input are we using– is it byte (0-255) or decimal point (0.0..1.0) grayscale? Byte is probably much faster b/c we can leverage direct arrays.
Can we approximate down to 255 ... heck even 1024... (or something)
Hypothesis: the Downhill Filter implementation could be optimized (eg above discussion re not needing to actually sort). There's also potential for parallelization:
Alternative algorithm: "Fast Hybrid gray scale reconstruction" as presented in the 2012 paper: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF
Note that this is a spelled out version of "Approach D" in the 2004 downhill filter paper:
Here's the algorithm:
Given the proof of GPU parallelization, and the more straightforward algorithm structure (scan, scan, then queue), I think parallelizing Fast-Hybrid on CPU is a good option.
Here's the 2012 fast-hybrid-GPU speed comparison:
Note that the FH CPU
implementation is on a single-core CPU. Effectively, the proposal is to implement FH CPU but in parallel (multi-core or distributed).
Note also that the scan structure gives us linearity of data access, versus the height map which is linear over the data during initialization but then "bounces all over" the image as it iterates over the height map. In scan-scan-queue we only have that problem during the queue phase. So the question might also be, how many pixels get requeued. Note that the 2012 paper parallelizes the queue part on GPU.
Maybe we should just skip all this and go straight to parallelized GPU implementation lol. But GPUs haven't been readily available and personally I wouldn't know how to write GPU code to do this. I also think we haven't nearly hit the ceiling of CPU-only performance, we don't need real-time we just need not-5-minutes.
Additional finding to support not pursuing Downhill Filter. Original paper by Vincent (1993) (private link for full text pdf), which reviews parallel + sequential approaches, says the 'classic' (aka math-derived) approach is highly parallelizable but not on "conventional computers":
But "conventional computers" look very different in 2023 from 1993! Scaling horizontally to many processors/computers is far easier than 20 years ago.
I have some naive ideas how to implement the above "hybrid" scan in parallel. But first I'm researching how hardware/GPU solutions work. It should be inspirational for a CPU/cloud version.
Googling led me to this paper:
Fast Morphological Image Processing on GPU using CUDA by Mugdah A. Rane at the College of Engineering, Pune she in turn references the "Van Herk / Gil & Kimmel" aka vHGK algorithm. Currently reading ^^ this paper, plus looking vHGK source code. Read this some more? https://github.com/cuekoo/Herk-Gil-Werman-Fast-Morphology
I found a paper from the "GK" of the original HGK paper that improves on the algorithm. 2003-Jan Gil & Kimmel: Efficient Dilation, Erosion, Opening and Closing Algorithms full text
This is a fairly detailed explanation / walkthrough of the algorithm improvements. However it's 1-dimensional and would need to be generalized to 2.
Reading: page 8
Reconstruction algorithms refer to the "window" or "structuring element", basically the window within which we compute a min/max image processing.
DeepCell uses radius = 2, for Mesmer. source code I think this means the window is 5x5 (2 each way from a center point). As determined by using the disk function (source code]) defined in this source code. Note that although the window is 5x5 we use a circle of it (b/c radius).
If I understand correctly, the more granular this window, the more "tasks" we have to run. Soooo a radius=2 means we're doing a lot of work. 😤
OK–
The Hybrid Algorithm for dilation is a max
operation applied in raster order, then in anti-raster order, then the queue-based max
filtering. As seen in A fast parallel implementation of queue-based morphological reconstruction using gpus.
The HGK algorithm, plus improvement as described in Efficient Dilation, Erosion, Opening and Closing Algorithms (2003), perform the parallel max
phases.
Our algorithm would need to parallelize these units of work. For a radius of 2, aka p=5, that means doing the "obvious" (aka embarrassing) parallelism is (img-width/5) * (img-height/5)
windows (I think). Given how memory-expensive the whole thing is, that's a lot better. But. Can we get it down to 512px, because 20k/5 = 4k which is 8x as big as the 512x512 which seems to be "Very Fast".
I rewrote the grayscale reconstruction part of h_maxima using OpenCV, which I understand to have better performance here than Scikit based on this github issue.
question from pratapvardhan:
wondering - if scipy.ndimage is already used in skimage, why not use faster versions of scipy.ndimage.{grey_dilation, .grey_operations}?
answer from jni:
to not break the API. Our versions are slower and mess with the data types! Grrr. I've been campaigning to do just this but haven't had time to implement, but when I do (should be soon actually!), we have to be careful not to break legacy code.
Rest of thread shows improvements in other algorithms but not the ones we're using (h_maxima/grayreconstruct).
def opencv_reconstruct(marker: np.ndarray, mask: np.ndarray, radius: int = 2):
kernel = disk(radius)
# .item() converts the numpy scalar to a python scalar
pad_value = np.min(marker).item()
image = marker
expanded = np.ndarray.copy(marker)
while True:
expanded = cv2.dilate(
src=image,
dst=expanded,
kernel=kernel,
borderType=cv2.BORDER_CONSTANT,
borderValue=pad_value,
)
expanded = np.fmin(expanded, mask)
# Termination criterion: Expansion didn't change the image at all
if (image == expanded).all():
return expanded
np.copyto(dst=image, src=expanded)
def opencv_h_maxima(image, h=0.075, radius=2):
resolution = 2 * np.finfo(image.dtype).resolution * np.abs(image)
shifted_img = image - h - resolution
reconstructed = opencv_reconstruct(shifted_img, image, radius)
residue_img = image - reconstructed
return (residue_img >= h).astype(np.uint8)
Test results vs scikit h_maxima
:
Interpretation
As stated in the papers, the "straightforward" iteration approach is very slow for larger inputs. But it's 10x as fast for 512 x 512.
(I'll put up a notebook shortly)
Open avenues:
Good citizen work?
Benchmark implementation / notebook: https://github.com/dchaley/deepcell-imaging/pull/17
My plan for OpenCV implementation was to use dilation with a before/after footprint (don't include pixels as appropriate during the scan forward or backward). But I have no visibility into the iteration order, in fact I think the whole point was parallel execution whereas FastHybrid needs very specific iteration order.
But all hope is not lost 😅 The scikit grayreconstruct python implementation uses C under the hood to perform certain operations, after setting things up in numpy.
I believe it should be "easy" to create a similar .pyx iteration function, to perform fast-hybrid's raster scans, and defer back to python for the FIFO queue. Starting point: https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html
One concern is that if we start including C somewhere then it has to be super easy to build/deploy/install. Solving in scikit itself would be nice 🙏🏻 cause it'd be actually invisible to DeepCell (modulo a scikit version upgrade).
I implemented "fast hybrid" in normal Python. As expected it's a lot slower.
Image size: (512, 512)
scikit h_maxima: 0.15 s
opencv h_maxima: 0.01 s
OpenCV Speedup: 9.75x
scikit == opencv : True
custom python h_maxima: 9.62 s
Python speedup: 0.02x
scikit == custom python : True
However the python code should be very readily portable to Cython, well, I hope. At least the array scan parts of it. This is not the worst as it's relatively common to distribute buildable cython code.
https://github.com/dchaley/deepcell-imaging/pull/27 reimplements the fast-hybrid scan in Cython. As hoped, it's a LOT faster. 🎉
PR TLDR:
With this proof of concept speeding up h_maxima by an order of magnitude, I'm declaring this spike complete 😤 cc @lynnlangit
As noted from the PR there are at least 2 potential next steps:
The most expedient to our purposes is updating DeepCell– but it would be nice to contribute to scikit…
Followup issue here: https://github.com/dchaley/deepcell-imaging/issues/28
Mesmer postprocessing has 3 main steps: h_maxima, watershed, and fill_holes. (See: #9)
h_maxima happens in the deepcell-toolbox
which in turn calls scikit-image h_maxima, which lastly is largely implemented using gray reconstruction.
This ticket is to investigate optimization opportunities.