cwlkr / torchvahadane

Gpu accelerated vahadane stain normalization for Digital Pathology workflows.
MIT License
17 stars 3 forks source link

How to parallelize? #4

Open BlueSpace-ice opened 4 months ago

BlueSpace-ice commented 4 months ago

I'm sorry to bother you, but is it possible to provide a graphics card for parallel processing of multiple slices at the same time? I use the for loop and it takes too long. An unknown error occurred when I used the from multiprocessing import Pool module. So I don't have any other options. thank you!

CielAl commented 4 months ago

I have the same observation but I think the challenges would be: (1) Individual inputs take different number of steps to converge and therefore a different variant of ISTA algorithm (so far torchvahadane implments the ISTA and FISTA) might be needed, or at least manage the step size depending on the losses of a whole batch of image.

(2) the tissue masking - essentially vahadane performs the dictionary learning to tissue pixels of each image, and therefore the dimensionality of the actual input of dictionary varies among the input images, depending on the tissue region.

I would recommend you to simply cache the stain matrices of all images that may be reused to avoid recomputation, and/or use faster approaches to obtain stain concentrations (e.g., least square solver torch.linalg.lstsq) from OD and stain matrices if you have specific needs of time efficiency.

An example of using least square to solve concentration is attached here, derived from @cwlkr 's codes.

cwlkr commented 1 month ago

Hello,

Unfortunately, this not really feasible. At least not in a straight forward manner.

The problem lies in CUDA itself as far as I understand. In CUDA, tasks are accelerated by splitting a task into smaller simple step tasks that can be run in parallel on GPU kernels. CPU parallelization however runs the same task for different inputs in parallel. Unfortunately, in how CUDA is constructed these are not easily inter-mixable. As far as I know this is more related to shared memory issues, rather than the optimization algorithms themselves.

There might be a fix nowadays with torch.multiprocessing, but I lack the time at the moment to investigate this further.

If its a training situations, setting your num_workers to 16 more, usually still results in a good GPU utilization, as forward pass + backprob can anyhow take longer than the (parallelized) image normalization/augmentation.

For this, I have seen that in SPAMS, it is better to set numThreads=1 and have a higher num_workers, as creating new threads all the time can be slow.

CielAl commented 1 month ago

Might be able to do it in a multi-GPU scenario while each GPU utilizes their own process (e.g., dask-cuda etc.) but that's up to how user creates their own workflows rather than what a stain normalization toolkit should resolve.