Adding CUDA support in PR#287

CielAl commented 3 months ago

Hi Andrew and Nan, @choosehappy @nanli-emory

since I am tired of running 1k+ WSIs for some paper over and over I add the CUDA support (#287) into the WSIHandles using RapidsAI's cucim.CuImage, and here goes some thoughts, benchmarks, and todos:

Motivation and Feasibility (1) histoqc.wsihandle.CuImageHandle itself use cucim.clara.CuImage as the file handle, which intentionally mirrors a subset (and is expanding) of openslide's API as described in their documentation (under development). It uses cupy.ndarray, which mostly mirrors the numpy.ndarray. cuCIM also by itself provides a mirror of skimage for most of common operations (e.g., filters, resize, color transformation etc.): cucim.skimage. These are pretty much the prerequisites of introducing the GPU acceleration without breaking the interface and compatibility of our code and cuCIM seems to be a good candidate.

(2) HistoQC's pipeline is mainly bottlenecked by downsampling (thumbnails) and different sets of filtering options. The thumbnail extraction procedure is occasionally dominated by the read_region which decode the compressed data from tiled pyramid data. I did a few simple benchmarks and here is the result, using a 40x WSI with size 46407x53784.

Downsampling using the logic of get_thumbnails (i.e., find and read the closest level then resize) is about 9~10x faster with CUDA (RTX Titan) than using CPU (openslide).
Downsampling using the resizeTileDownward is about 6x faster with GPU.
Gaussian Filtering over a 2048x2048 region with a sigma of 5 is about 300x faster with GPU acceleration than the conventional CPU operations from skimage.transforms.

(3) Since cupy.ndarray pretty much mirrors most of the essential API of numpy.ndarray, and the conversion between each other is trivial and simple, and there exists high-level function/transformations that shares the same interface (e.g., cupyx replaces scipy and cucim.skimage replaces skimage), this would also be esier to introduce GPU acceleration into custom modules with careful planning.

(4) We could definitely list cucim as an extra_requirement in toml and setup.py so it can be optional, without breaking the current pipeline.

Changes (1) We might need to explicitly define, differentiate and backend data type. For instance, openslide mostly returns PIL.Image.Image, and cucim.CuImage returns cucim.CuImage itself. Since cucim.CuImage is a centerpiece of most operations inside CuImageHandle yet not sharing any interfaces with PIL.Image.Image, a different layer of abstraction might be needed in WSIImageHandle to differentiate the different backend data type the handle returns for internal procedures and the unified output types to external procedures. Herein, a method of backend_region is added, which returns the raw output of reading a single region from the handle. While the read_region read the backend_region and explicitly unified them to PIL.Image.Image using the backend_to_pil.

(2) We might need to explicitly manage the array data type beneath the backend data, mostly for external transformations. The array data in many cases share the same memory of the backend data (e.g., the numpy.ndarray from the PIL.Image.Image and cupy.ndarray from cucim.CuImage) and is trivial to retrieve. This procedure is defined by backend_to_array method. The array data should share similar interfaces and methods that exposes and returns such array data would allow the introduction of GPU operations to external procedures.

For instance, getBestThumb returns np.ndarray for OpenslideHandle and cupy.ndarray for CuImageHandle, which allows downstream operations from external modules to identify whether GPU accelerated transformations should be applied.

(3) The resizeTileDownward and getBestThumb methods seem to be tightly coupled to the image handles rather than BaseImage, and with the modifications from (1) and (2) above, they are integrated into the WSIImageHandle as generic procedures. Users only need to override methods such as region_backend, backend_to_array, etc. without modifying the resizeTileDownward and getBestThumb themselves. (Benchmark was done via the integrated resizeTileDownward and getBestThumb.

TODO (1) Image operations with GPU acceleration in custom modules using cucim.skimage and cupyx. (2) Vendor info. So far cucim.CuImage simply lumps vendor-specific metadata (e.g., mpp, mag etc.) into its metadata field, and it lacks a unified parser to read these meta data into a map of common property names like what openslide is doing. Currently cuCIM does not offer more detailed documentation regarding the metadata and as a workaround, a dummy openslide.OpenSlide is associated into the CuImageHandle only for reading properties.

choosehappy commented 3 months ago

A pile of work here, very impressive!

My understanding was that Cucim previously only supported a couple of WSI formats - in particular those with flat tiff like structures. is that still the case? I'm personally interested in mrxs formats since there appear to be tons of these floating around, but i suspect some of the ndpi ones are also very widely used.

does Cucim now support them?

i'd point out as well that the latest version of openslide now perfectly supports dicom - which is a huge must have since this appears to be the format that folks will be using clinically going forward (to be compatible with their PACS systems). How would that format fit in here as well?

if those check out - last question, what do you imagine the integration to look like? supporting a CPU + GPU versions separately? there are many installations of histoqc which currently reside in CPU only environments, so we won't be able to eliminate that option likely ever. Would we instead have CuCim run on CPU? will there be a negative penalty for that?

adding @jacksonjacobs1 as senior developer

CielAl commented 3 months ago

A pile of work here, very impressive!

My understanding was that Cucim previously only supported a couple of WSI formats - in particular those with flat tiff like structures. is that still the case? I'm personally interested in mrxs formats since there appear to be tons of these floating around, but i suspect some of the ndpi ones are also very widely used.

does Cucim now support them?

i'd point out as well that the latest version of openslide now perfectly supports dicom - which is a huge must have since this appears to be the format that folks will be using clinically going forward (to be compatible with their PACS systems). How would that format fit in here as well?

if those check out - last question, what do you imagine the integration to look like? supporting a CPU + GPU versions separately? there are many installations of histoqc which currently reside in CPU only environments, so we won't be able to eliminate that option likely ever. Would we instead have CuCim run on CPU? will there be a negative penalty for that?

adding @jacksonjacobs1 as senior developer Hey thanks for the reply and here are my a few thoughts.

Format support and how does it work with current openslide-based implementation.

I looked into the documentation and as the time being cuCIM does support multi-resolutional tiled images but currently it doesn't mention the support of mrsx and here is the current list of supported format:

Aperio ScanScope Virtual Slide (SVS)
Philips TIFF
Generic Tiled, Multi-resolution RGB TIFF files with the following compression schemes:
- No Compression
- JPEG
- JPEG2000
- Lempel-Ziv-Welch (LZW)
- Deflate

Given the current development status of cuCIM I would expect more updates in future and therefore it makes sense to me to simply list the package cucim as an extra requirements (so you install with smth like pip install histoqc[cucim]), and openslide is still the best option for the most generic support and compatibility. We'd also make CuImageHandle a developing branch or an experimental feature as a lot may depend on future updates of cuCIM. (A brief disgression here: since openslide 4.0.0 may not handle well on all BigTIFF typed images in certain cases, we might need a TiffSlide-based handle specifically handling special tiff formats)

Integration with CPU code bases

That will bring us to your final question: how do we do the integration. First, as you said, the majority still use CPU-only environment and we can't forsake that. Meanwhile, I think it is easy to handle both CPU and GPU without breaking the backport compatibility for the below reasons.

(1) We can hide as much detail of cuCIM-related code behind the unified interface of WSIImageHandle. The pipeline and custom modules might need be aware of whether an incoming data type is numpy.ndarray or cupy.ndarray but nothing more. In fact, most of the semantic coupling cuCIM introduced for now is that the GPU acceleration relies on cupy.ndarray rather than numpy.ndarray - an array interface but device/driver dependent.

That means, we need to clearly define and regulates the return type of different set of interfaces such that:

Some are for external use and generic purposes - must be numpy.ndarray (the most generic) or PIL.Image.Image (generic but with more higher-level functionalities). read_region is an example.
Some are the direct type the file handle fetches/returns (e.g., openslide returns PIL.Image.Image and cucim returns CuImage). I define these types are the backend type of the image handle and one interface is region_backend, which you may consider as an internal interface of read_region and is mostly used in scenario wherein the context is aware of the handle type.
What is the data type of array format of the implemented WSIImagehandle, e.g., numpy.ndarray or cupy.ndarray etc. And what is the protocol to explicitly convert the backend type (e.g., PIL or CuImage) to the underlying array (np.ndarray or cupy.ndarray) for computations. (e.g., WSIImageHandle.backend_to_array)
Internal arithmetic operations and transformations so that the WSIImageHandle can mimic what an openslide.OpenSlide can do, e.g., getThumbnails etc. For instance, we might integrate the rgba2rgb (which seems to be scrapped from the openslide itself) into WSIImageHandle and let the specific handle decide how to blend the background --- For CuImage it doesn't care about alpha so it's always RGB. And we can integrate a resize function which getThumbnails definitely uses: the numpy.ndarray can be resized by using a PIL.Image wrapper (faster than skimage) and a cupy.ndarray can be resized by cucim.skimage.resize etc.

We currently integrate the getBestThumb and resizeTileDownward into WSIImageHandle, wherein the only thing differs are what data interface and resize functions were implemented by subclasses.

What interfaces returns array data type (numpy.ndarray, cupy.ndarray, etc) wherein external modules might use. E.g., we can discuss whether the getBestThumb here always returns a np.ndarray or it can be either.

GPU Acceleration into Modules

Three important things to note are: (1) cucim library does not provide APIs that directly transform cucim.CuImage itself. Rather, it works on the cupy.ndarray you obtained from the cucim.CuImage. (2) cucim.CuImage supports CPU! And in fact depends on the device attribute you can obtain a numpy.ndarray or cupy.ndarray from it. (3) A significant bottleneck that throttles the speed in CPU is actually openslide.OpenSlide.read_region, especially in thumbnail computation (wherein a large and closest level is read and resized as a whole in most of cases).

Based on (1) and (2) we actually have two options:

Option 1: we can make modules completely type-of-WSIImageHandle-independent -- it only follows the instruction of config.ini (gpu_support=1 or 0), and it converts data to cupy and uses corresponding functions by itself, regardless of whether CuImage is used. The good thing is that this way the GPU acceleration of modules are not coupled to the WSIImageHandle, so long as dependencies such as cupy, cupyx (cupy's scipy), and cupyimage (one of the options of cupy's skimage, other than cucim.skimage) etc installed.
Option 2: We want the GPU computation to be coupled with WSIImageHandle as they are semantically connected. In this case, remember that we defined in the WSIImageHandle which interfaces shall return array data for computation and the module simply rely on the return type of these interfaces.

Regardless of your options, you may need a dynamic import of function call or wrappers defined to handle numpy.ndarray and cupy.ndarray in module operations, and perhaps a Protocol class of array interface for typing purposes to unify numpy.ndarray and cupy.ndarray for more smooth coding.

I am inclined to Option 2 because

the module's cuda acceleration and the utilization of CuImage is really semantically connected and it makes sense.
you need to define either the dynamic import or unified function interfaces of these array transformations anyway, so any spaghettis about checking numpy.ndarray vs cupy.ndarray really goes into these functions and the modules themselves should never care about that.

CielAl commented 2 months ago

@nanli-emory @choosehappy @jacksonjacobs1 Since dicom_support is no longer an active branch, I peel off the necessary changes, wrap up the ImageHandle abstraction and corresponding deployment in BaseImage and QC modules, merging all changes to the main (in #290).

Involvement of WSIHandle as a standalone attribute in BaseImage could be a first step to clean the current hard-to-refactor BaseImage (with its swamps of str keys and logging msgs), and would be beneficial to all future new program features.

A few points:

(1) Below I prepare a scratch of scripts to compare the difference of methods under skimage and cucim.skimage, excluding test packages and private/protected methods. cucim.skimage mirrors 266 methods of skimage while misses 366 methods. But the good news is nearly all methods we need for QC procedures are in the 266 implemented ones. A quick glance show that the 366 unimplemented methods are mostly either io/draw/viewer related, or merged to implemented ones (e.g., filters.rank merged into filters as a generic filter module). There are very few necessary methods that are currently not implemented, e.g., graycomatrix. That seems to make it feasible to simply implement unified functional interfaces for skimage/scipy/numpy operations based on runtime check of input array type (numpy.ndarray vs. cupy.ndarray). And if a particular operation does not benefit from GPU acceleration or the CUDA-counterpart is not implemented yet, we downgrade it to numpy.ndarray temporarily and call the corresponding CPU-based approaches.

(2) It seems like we should prepare to moveforward the major python version from 3.7--3.8 to 3.8--3.10, as more packages are migrated to python3.10 and 3.7 is retiring. A few skimage methods are also obsolete and might not return numerically accurate results (see the issue of frangi). This means we might need to consider upgrading the corresponding packages as well. Hence, The function calls/signature of certain skimage methods were updated so it will be ready for upgrading (remove deprecating signatures etc.)

(3) Since the image handle is coupled with nearly all custom QC modules, all these modules are revised for the change and implementation of WSIImageHandle, along with lots of cleanings. TBH the implementation of Modules and BaseImage itself is kinda dirty, especially when the getitem is overused in custom modules and it would be a burden for refactoring even with a modern IDE.

import importlib
import pkgutil
from types import ModuleType

def find_public_methods(module):
    return [name for name in dir(module) if not name.startswith('_') and callable(getattr(module, name))]

def method_report(full_method_name: str, method_name: str,
                  present_methods: list, missing_methods: list, misc_errors: list):
    # Construct corresponding cucim.skimage method name
    cucim_method_name = full_method_name.replace('skimage', 'cucim.skimage', 1)
    cucim_module_name = ".".join(cucim_method_name.split(".")[:-1])
    try:
        cucim_module = importlib.import_module(cucim_module_name)
        result = getattr(cucim_module, method_name)
        present_methods.append(cucim_method_name)
    except (ImportError, AttributeError):
        # If the method does not exist in cucim.skimage, add to missing_methods list
        missing_methods.append(full_method_name)
    except Exception as e:
        print(e, full_method_name)
        misc_errors.append(full_method_name)

def module_report(name: str, present_methods: list, missing_methods: list, missing_modules: list, misc_errors: list):
    try:
        submodule = importlib.import_module(name)
    except ImportError:
        missing_modules.append(name)
        return
    if isinstance(submodule, ModuleType):
        methods = find_public_methods(submodule)
        for method in methods:
            full_method_name = f"{name}.{method}"
            method_report(full_method_name, method, present_methods, missing_methods, misc_errors)

def traverse_modules(package_name):
    package = importlib.import_module(package_name)
    missing_methods = []
    missing_modules = []
    present_methods = []
    misc_errors = []
    for loader, name, is_pkg in pkgutil.walk_packages(package.__path__, package.__name__ + '.'):
        # Skip the skimage.data submodule entirely
        if ("skimage.data" in name or any([x.startswith('_') for x in name.split('.')])
                or "tests" in name.split('.')):
            continue

        if is_pkg:
            module_report(name, present_methods, missing_methods, missing_modules, misc_errors)
            continue

        full_method_name = name
        method_name = name.split(".")[-1]
        method_report(full_method_name, method_name, present_methods, missing_methods, misc_errors)
        # if not pkg --> method

    return missing_methods, missing_modules, present_methods, misc_errors

# Usage
missing_sk_methods, missing_sk_modules, present_sk_methods, misc_sk_errors = traverse_modules('skimage')
print("Missing Methods:")
print(missing_sk_methods)
print("Missing Modules:")
print(missing_sk_modules)

print("Present Methods:")
print(present_sk_methods)

print("Misc Errors:")
print(misc_sk_errors)

choosehappy / HistoQC