dchaley / deepcell-imaging

Tools & guidance to scale DeepCell imaging on Google Cloud Batch

8 stars 2 forks source link

h_maxima performance #8

Closed dchaley closed 1 year ago

dchaley commented 1 year ago

Mesmer postprocessing has 3 main steps: h_maxima, watershed, and fill_holes. (See: #9)

h_maxima happens in the deepcell-toolbox

which in turn calls scikit-image h_maxima, which lastly is largely implemented using gray reconstruction.

This ticket is to investigate optimization opportunities.

dchaley commented 1 year ago

Comments reference an academic paper as basis for implementation:

Robinson, Kevin and Whelan, Paul F. (2004) Efficient morphological reconstruction: a downhill filter. Pattern Recognition Letters, 25 (15). pp. 1759-1767. ISSN 0167-8655

Available from Dublin City University or the Vision Systems Group lab page. Full text PDF: whelan_2004_104.pdf

Paper references source code but the URL is broken. I think I found it on the internet archive: prl_DownhillFilter.zip

dchaley commented 1 year ago

Morphological reconstruction is a "well studied" area. I found another interesting paper:

2012: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF

It references a 2010 paper by Pavel Karas: Efficient Computation of Morphological Greyscale Reconstruction. Dagstuhl Research Online Publication Server, Full text pdf

Question: how does scikit implementation compare vs these algorithms? Can we do a side-by-side of software algorithms, without introducing GPUs. (These papers support the ability to implement GPU algorithms if we need it)

dchaley commented 1 year ago

The Downhill Filter pseudocode from whelan_2004_104.pdf, page 7, translated.

Screenshot 2023-10-25 at 8 03 41 PM

The mask image is the input image. The markers are the DeepCell model outputs.

Initialization

Find the maximum pixel value m from the marker image (model output).

Screenshot 2023-10-25 at 8 08 33 PM

in English: the number m, in the marker pixel values, such that for every pixel value p, m >= p

Place each non-zero marker pixel into the appropriate "height list" (one per pixel value between 1 and the maximum value)

Screenshot 2023-10-25 at 8 08 10 PM

in English: For all image points, if the point's marker value i is non-zero, then, append that point to the height list L at the height aka index i.

Algorithm loop

For each height list in L, for height n from highest to lowest:
- While the list of pixels at height n is not empty,
- Operate on p, the first pixel at height n.
- Remove p from the height list at n.
- For each of p's neighbors q:
  - If the mask (input) image pixel value at q is non-zero, and, the new marker value at q has not been assigned,
  - Set the new marker value at q to the smallest of height n and the neighbor's value.
  - Assign the neighbor point q to the height list at the resulting smallest height.

dchaley commented 1 year ago

✨ interesting finding ✨ the Downhill Filter paper doesn't mention sorting. Sorting appears to be expensive in the scikit implementation. Screenshot 2023-10-25 at 9 16 38 PM

Note that rank_order performs another argsort: sort_order = flat_image.argsort().astype(unsigned_dtype, copy=False) source code. For some reason it's not pictured here.

Two potential solutions

1: Limited impact, but maybe easier?

We're sorting twice, do we actually have to? Can we reuse the sort in rank-order?

2: Rewrite algorithm

In principle we don't need to sort: we just need to group the pixels by intensity. Would it be faster to do more literally as the algorithm says, and create a list per intensity value? What input are we using– is it byte (0-255) or decimal point (0.0..1.0) grayscale? Byte is probably much faster b/c we can leverage direct arrays.

Can we approximate down to 255 ... heck even 1024... (or something)

dchaley commented 1 year ago

Hypothesis: the Downhill Filter implementation could be optimized (eg above discussion re not needing to actually sort). There's also potential for parallelization:

initialization: mapping the height-list construction to batches, then reducing the tiles into a combined height list. (But that doesn't actually "reduce" the result size, as we still need to categorize each pixel from the markers.)
loop: can possibly process each height list in batches? in principle we only add "down" the heights not into higher levels, can we use this somehow?

Alternative algorithm: "Fast Hybrid gray scale reconstruction" as presented in the 2012 paper: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF

Note that this is a spelled out version of "Approach D" in the 2004 downhill filter paper:

Screenshot 2023-10-26 at 2 16 30 PM

Here's the algorithm:

Screenshot 2023-10-26 at 2 15 27 PM

Given the proof of GPU parallelization, and the more straightforward algorithm structure (scan, scan, then queue), I think parallelizing Fast-Hybrid on CPU is a good option.

Here's the 2012 fast-hybrid-GPU speed comparison:

Screenshot 2023-10-26 at 2 37 12 PM

Note that the FH CPU implementation is on a single-core CPU. Effectively, the proposal is to implement FH CPU but in parallel (multi-core or distributed).

Note also that the scan structure gives us linearity of data access, versus the height map which is linear over the data during initialization but then "bounces all over" the image as it iterates over the height map. In scan-scan-queue we only have that problem during the queue phase. So the question might also be, how many pixels get requeued. Note that the 2012 paper parallelizes the queue part on GPU.

Maybe we should just skip all this and go straight to parallelized GPU implementation lol. But GPUs haven't been readily available and personally I wouldn't know how to write GPU code to do this. I also think we haven't nearly hit the ceiling of CPU-only performance, we don't need real-time we just need not-5-minutes.

dchaley commented 1 year ago

Additional finding to support not pursuing Downhill Filter. Original paper by Vincent (1993) (private link for full text pdf), which reviews parallel + sequential approaches, says the 'classic' (aka math-derived) approach is highly parallelizable but not on "conventional computers": Screenshot 2023-10-26 at 9 58 12 PM

But "conventional computers" look very different in 2023 from 1993! Scaling horizontally to many processors/computers is far easier than 20 years ago.

I have some naive ideas how to implement the above "hybrid" scan in parallel. But first I'm researching how hardware/GPU solutions work. It should be inspirational for a CPU/cloud version.

Googling led me to this paper:

Fast Morphological Image Processing on GPU using CUDA by Mugdah A. Rane at the College of Engineering, Pune she in turn references the "Van Herk / Gil & Kimmel" aka vHGK algorithm. Currently reading ^^ this paper, plus looking vHGK source code. Read this some more? https://github.com/cuekoo/Herk-Gil-Werman-Fast-Morphology

dchaley commented 1 year ago

I found a paper from the "GK" of the original HGK paper that improves on the algorithm. 2003-Jan Gil & Kimmel: Efficient Dilation, Erosion, Opening and Closing Algorithms full text

This is a fairly detailed explanation / walkthrough of the algorithm improvements. However it's 1-dimensional and would need to be generalized to 2.

Reading: page 8

Screenshot 2023-10-26 at 10 35 33 PM

dchaley commented 1 year ago

Reconstruction algorithms refer to the "window" or "structuring element", basically the window within which we compute a min/max image processing.

DeepCell uses radius = 2, for Mesmer. source code I think this means the window is 5x5 (2 each way from a center point). As determined by using the disk function (source code]) defined in this source code. Note that although the window is 5x5 we use a circle of it (b/c radius).

If I understand correctly, the more granular this window, the more "tasks" we have to run. Soooo a radius=2 means we're doing a lot of work. 😤

dchaley commented 1 year ago

OK–

The Hybrid Algorithm for dilation is a max operation applied in raster order, then in anti-raster order, then the queue-based max filtering. As seen in A fast parallel implementation of queue-based morphological reconstruction using gpus.

A fast parallel implementation of queue-based morphological reconstruction using gpus

The HGK algorithm, plus improvement as described in Efficient Dilation, Erosion, Opening and Closing Algorithms (2003), perform the parallel max phases.

Our algorithm would need to parallelize these units of work. For a radius of 2, aka p=5, that means doing the "obvious" (aka embarrassing) parallelism is (img-width/5) * (img-height/5) windows (I think). Given how memory-expensive the whole thing is, that's a lot better. But. Can we get it down to 512px, because 20k/5 = 4k which is 8x as big as the 512x512 which seems to be "Very Fast".

dchaley commented 1 year ago

I rewrote the grayscale reconstruction part of h_maxima using OpenCV, which I understand to have better performance here than Scikit based on this github issue.

question from pratapvardhan:

wondering - if scipy.ndimage is already used in skimage, why not use faster versions of scipy.ndimage.{grey_dilation, .grey_operations}?

answer from jni:

to not break the API. Our versions are slower and mess with the data types! Grrr. I've been campaigning to do just this but haven't had time to implement, but when I do (should be soon actually!), we have to be careful not to break legacy code.

Rest of thread shows improvements in other algorithms but not the ones we're using (h_maxima/grayreconstruct).

def opencv_reconstruct(marker: np.ndarray, mask: np.ndarray, radius: int = 2):
    kernel = disk(radius)

    # .item() converts the numpy scalar to a python scalar
    pad_value = np.min(marker).item()

    image = marker
    expanded = np.ndarray.copy(marker)
    while True:
        expanded = cv2.dilate(
            src=image,
            dst=expanded,
            kernel=kernel,
            borderType=cv2.BORDER_CONSTANT,
            borderValue=pad_value,
        )
        expanded = np.fmin(expanded, mask)

        # Termination criterion: Expansion didn't change the image at all
        if (image == expanded).all():
            return expanded

        np.copyto(dst=image, src=expanded)

def opencv_h_maxima(image, h=0.075, radius=2):
    resolution = 2 * np.finfo(image.dtype).resolution * np.abs(image)
    shifted_img = image - h - resolution
    reconstructed = opencv_reconstruct(shifted_img, image, radius)
    residue_img = image - reconstructed
    return (residue_img >= h).astype(np.uint8)

Test results vs scikit h_maxima :

Interpretation

As stated in the papers, the "straightforward" iteration approach is very slow for larger inputs. But it's 10x as fast for 512 x 512.

(I'll put up a notebook shortly)

Open avenues:

Tile large images into 512x512 (find right size) & figure out boundary.
Implement FastHybrid using OpenCV primitives. Putting items into the queue not obvious?

Good citizen work?

scikit PR to reimplement grayreconstruct using fast scipy primitives

dchaley commented 1 year ago

Benchmark implementation / notebook: https://github.com/dchaley/deepcell-imaging/pull/17

dchaley commented 1 year ago

My plan for OpenCV implementation was to use dilation with a before/after footprint (don't include pixels as appropriate during the scan forward or backward). But I have no visibility into the iteration order, in fact I think the whole point was parallel execution whereas FastHybrid needs very specific iteration order.

But all hope is not lost 😅 The scikit grayreconstruct python implementation uses C under the hood to perform certain operations, after setting things up in numpy.

I believe it should be "easy" to create a similar .pyx iteration function, to perform fast-hybrid's raster scans, and defer back to python for the FIFO queue. Starting point: https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html

One concern is that if we start including C somewhere then it has to be super easy to build/deploy/install. Solving in scikit itself would be nice 🙏🏻 cause it'd be actually invisible to DeepCell (modulo a scikit version upgrade).

dchaley commented 1 year ago

I implemented "fast hybrid" in normal Python. As expected it's a lot slower.

Image size: (512, 512)
scikit h_maxima: 0.15 s
opencv h_maxima: 0.01 s
OpenCV Speedup: 9.75x
scikit == opencv : True
custom python h_maxima: 9.62 s
Python speedup: 0.02x
scikit == custom python : True

However the python code should be very readily portable to Cython, well, I hope. At least the array scan parts of it. This is not the worst as it's relatively common to distribute buildable cython code.

dchaley commented 1 year ago

https://github.com/dchaley/deepcell-imaging/pull/27 reimplements the fast-hybrid scan in Cython. As hoped, it's a LOT faster. 🎉

PR TLDR:

~6x improvement over scikit h_maxima for 512x512 images.
~18x speedup for at least one 10k x 14k image.

With this proof of concept speeding up h_maxima by an order of magnitude, I'm declaring this spike complete 😤 cc @lynnlangit

As noted from the PR there are at least 2 potential next steps:

Submit scikit PR swapping/adding cython fast-hybrid vs existing downhill filter
and/or submit DeepCell-toolbox PR to use new implementation (... which probably means making a python library)

The most expedient to our purposes is updating DeepCell– but it would be nice to contribute to scikit…

Followup issue here: https://github.com/dchaley/deepcell-imaging/issues/28