clEsperanto / pyclesperanto_prototype

GPU-accelerated bio-image analysis focusing on 3D+t microscopy image data
http://clesperanto.net
BSD 3-Clause "New" or "Revised" License
208 stars 44 forks source link

Not working within multiprocessing #144

Open DirkRemmers opened 2 years ago

DirkRemmers commented 2 years ago

Hi all,

First of all I want to say what a super nice package this is, thanks for the hard work! Second, I am not sure if this is an issue on the pyclesperanto side, or on the multiprocessing package. However, I thought it would be good to report the issue here for sure.

At the moment I'm trying to setup a high-throughput image analysis pipeline. For some of the functionality I need CPU processes, while for other processes I want to use the GPU. To speed up the CPU processes, I would like to use multiprocessing to optimize my workstation. However, I might have found an issue while trying to design a script that uses both at the same time.

When I have the GPU processes outside of the multiprocessing functionality, all works nicely. A simplified example:

from multiprocessing import Pool
from skimage import filters
import pyclesperanto_prototype as cle

def image_analysis_CPU(image):
     # do something with the image
     blurred_image = filters.gaussian(image, sigma = 2)
     return blurred_image

# get a list of images called image_list and continue with the CPU processes

n_cores = 4 # example
pool = multiprocessing.Pool(n_cores)
result_list = pool.map(image_analysis_CPU, image_list)
pool.close()
pool.join()

# then loop over the result_list and do all GPU processes
# notice that if there are many images, the memory requirements of result_list are becomming very high
cle.select_device('RTX') 
gpu_result_list = []
for result in result_list:
     gpu_input = cle.push(result)
     # do my image analysis to get to gpu_result
     analysis_result = cle.pull(gpu_result)
     gpu_result_list.append(analysis_result)

# continue with the rest of the script

However, due to the file sizes I'm going to process I am affraid this solution is not scalable since it would require to store a lot of data before processing using the GPU. The data that is required for the GPU process is relatively large, but can be removed after the GPU process is finished (I'm only interested in the result of this). After the GPU process has been performed, I am able to efficiently save my data, but saving the GPU input is not efficient and will require some unelegant data handling afterwards. Ideally, I would like to streamline my script so that I could make use of the GPU functionality within the process of a Pool worker. This way, I can prevent saving lots of data that is required to do the GPU processing. I included a lock to prevent situations where multiple workers are trying to access the GPU at the same time. A simplified example code would be:

from multiprocessing import Pool, Lock
from skimage import filters
import pyclesperanto_prototype as cle

def image_analysis(image):
    blurred_image = filters.gaussian(image, sigma = 2)
    lock.acquire() # to prevent multiple processes trying to access the GPU at the same time
    cle.select_device('RTX')
    gpu_input = cle.push(blurred_image)
    # do my image analysis to get to gpu_result
    analysis_result = cle.pull(gpu_result)
    lock.release()
    return analysis_result

def init(l):
    global lock
    lock = l

def main():
     # get a list of images called image_list and continue
     l = Lock() # define the lock outside the Pool so that they all have access to the same variable
     n_cores = 4 # example
     pool  = Pool(n_cores, initializer = init, initargs = (l,))
     result_list = pool.map(image_analysis, image_list)
     pool.close()
     pool.join()

# continue with the rest of the script

The issue that I get with this approach though is that all pyclesperanto functionality causes the Pool worker to freeze, it cannot pass any of the cle functions. Even selecting a GPU via the cle.select_device('RTX') line is not working. There are no error messages for me to include in this issue, the script just seems to get stuck while trying to perform a pyclesperanto related function.

Obviously the example code is simplified drastically, otherwise I could do the entire thing on the GPU. When a full script is required, please let me know and I can send all required example files.

Thanks in advance, I'm looking forward to your response.

haesleinhuepf commented 2 years ago

Hi @DirkRemmers ,

interesting! I don't have much experience with mutiprocessing.Pool but I would be happy to dig a bit into this as it could be relevant for my image analysis work as well. For this, it would be good to have a functional working example. Your code is really good and almost there, but I can't "just" run it unfortunately. Example data could be something like np.random.random((1024,1024, 100)). Feel free to upload a jupyter notebook somewhere on github, that I can just clone, modify and (hopefully) send back a PR with useful feedback. see also.

Best, Robert

DirkRemmers commented 2 years ago

Hi Robert,

Thanks for the quick response! I made this repository on github for you to clone. Hopefully this is complete and clear enough to use, if not please let me know what I can change to improve it.

I went ahead and tried using another method for parallel processing called MPI4py. This creates the 'Pool workers' (or threads / ranks as called in MPI4py) so that they behave different than the multiprocessing workers. Due to this difference, each thread can access the GPU, thereby allowing me to have the ideal workflow mentioned in the first post of this issue.

Although I now have a solution for my specific problem, this does not solve the core issue with multiprocessing, so if you are still willing to look into this it would be good I think. It can turn out to be a multiprocessing issue, but it might still be good to document this for future users that have the idea to use multiprocessing together with pyclesperanto.

Cheers, Dirk

jackyko1991 commented 9 months ago

@DirkRemmers

I have tried similar parallel processing with multiprocessing pool and pathos.pool module.

with multiprocessing it complaints with un-compatibility with Pickle, which drives me to pathos parallel run.

Even you can start multithreading run with pathos, CLE exhibits severe blocking and not speeding up the analysis process at all.

From you experience would mpi4py the only solution can fix this problem? The attached link is broken and can take reference to your previous work. Would you mind to upload the code again?

DirkRemmers commented 9 months ago

@jackyko1991

Thanks for sharing your experience as well, I have not tried out pathos myself, but good to know it does not improve the situation. From all things I tried out, only mpi4py has been succesfull. I'm not sure what happened to the link, but I'll try to find some time to write up another example for you. You'll see it appear here when done!

Cheers, Dirk

DirkRemmers commented 9 months ago

@jackyko1991

I had some time to get a quick example up and running. You can find the new repo here. This time I'll try to not break it 😄.

The code I now put in there is a minimal use case on how to use MPI4py. The methods are much more advanced, and the code could most likely be optimized to minimalize memory usage. If you think this method is useful, please make sure to check out the MPI4py documentation for more advanced usecases.

Cheers, Dirk

jackyko1991 commented 9 months ago

@jackyko1991

I had some time to get a quick example up and running. You can find the new repo here. This time I'll try to not break it 😄.

The code I now put in there is a minimal use case on how to use MPI4py. The methods are much more advanced, and the code could most likely be optimized to minimalize memory usage. If you think this method is useful, please make sure to check out the MPI4py documentation for more advanced usecases.

Cheers, Dirk

@DirkRemmers Great job! I have tested both openmp and and mpich on Ubuntu environment. They both worked nicely.

Should we somehow integrate the mpi workflow to the pyclesperanto as batch processing example?