Closed acycliq closed 11 months ago
How much memory do you have on your GPU and what is the size of the image you are processing. Maybe it is an out of memory error?
If an out of memory error you can try to call the deconvolution with dask. See this script for an example https://github.com/clij/clij2-fft/blob/master/python/clij2fft/test_richardson_lucy_dask.py
Yes indeed! You are correct, it is a memory issue. Unfortunately I have to work with an outdated RTX2060, 6GB. I have a 3d image with dims 60 planes by 2300-2300px and I tried your implementation on the mid part of it, ie selected the 20 planes in the middle of the stack. I works quite nice and it is fast.
Apologies for my ignorance, would it make sense to make 3 thinner images, each of dimension 20planes by 2300-2300px, feed them to the richardson_lucy function and then combine them at the end?
Unfortunately I wasnt able to run succesfully the non circular function, I am getting the same error, I assume it is my gpu memory
Many thankx!
Hi @acycliq
I just added an easier way to access the dask (chunked) version of richardson_lucy_nc see here.
To get access to this you will have to update clij2-fft by running
pip install clij2-fft
Note that clij2-fft recently moved from the test pypi, to the main pypi so the installation command is slightly different (and easier) now. You should get version 0.23 (the latest).
Then you can call
from clij2fft.richardson_lucy_dask import richardson_lucy_nc_dask
decon=richardson_lucy_nc_dask(im, psf_decon, numiterations=100, regularizationfactor=0.005, x_chunk_size=512, y_chunk_size=512, overlap=24)
Where you set x_chunk_size, and y_chunk size to values small enough to handle your image. Maybe try 512 and work down from there if it still doesn't work.
I don't usually chunk in z because the PSF is often elongated in z, and chunking in the z dimension can lead to artifacts.
Hi Brian
Thanks so much! In the meanwhile I have been experimenting on my side too. I almost ended up with something along the lines
chunk_size = (60, 2304/2, 2304/2) # my image is 60 by 2304-2304px
arr = da.from_array(img, chunks=chunk_size)
def deconv(arr):
return richardson_lucy(arr, psf, 50, 0.002)
rl = arr.map_overlap(deconv, depth=(6,6,6), boundary='reflect', dtype='float32').compute(num_workers=1)
I will post a more complete example when I finish, maybe it will help some others in the community but your update maybe makes that redundant. Anyways, I now playing around with the dask_array.map_blocks
function.
I didnt chunk the z too, I just split in half both x and y and my gpu processes it with out problems. Btw how can I flush my gpu, I have been experimenting with different chunk sizes and once I send too big chunks to be processed then the next calls fail despite the new chunks are a lot smaller.
Also, the num_workers in the compute call above, setting that to a higher number would only be meaningfull in a multi-gpu system, do I get that correct please? I tried with 2 workers and it fails.
Thanks again! I have to say its a really nice lib, runs so fast! It is convenient too it doesnt require cuda, hence can be run on wider spectrum of configurarions if I understad well (apple M1 for example, altthough Im not using any....)
If you get a chance you could add to this discussion.
I am not sure how to reset the GPU. I will have to check the opencl API to see if there is anything. I will also check over the c++ part of the code and make sure I am cleaning up everything after an error occurs.
My understanding is that 2 workers should use 2 GPUs, but I do not have a multi-GPU system so haven't tested this. When I get a chance I'll start up an AWS VM with multiple GPUs to test this.
In addition to running on M1 the clij2-fft implementation of RL has total variation regularization noise handling, and non-circulant edge handling, both of which can be very important in some situations.
Ok yes sure, I will add to that post, thanks for the link.
At the moment I am flushing the gpu manually, ie on the terminal I do:
sudo fuser -v /dev/nvidia*
then I find the pid associated with python and i kill it
sudo kill -9 PID
(Thats on Linux, I do not know on WIndows...)
I asked because I was thinking to send the full image first to the deconvolution algo (most ppl have better gpus than I do....) and if it fails then to catch it in a try/catch loop and feed it by chunks. But I would have to flush my gpu memory after the exception. I imagine it would run faster in one go than in chunks
Anyways, thanks once more, you helped me massively. I keep this open, I will close it when I post my coding to the discussion thread you pointed me to, (I think tomorrow...)
Another approach would be to check how much memory is available on the GPU, then compute the chunk size from there. It is apparently a bit tricky to get live memory metrics with opencl, but we can at least get the amount of memory on the card. Something like
import pyopencl as cl
platforms = cl.get_platforms()
devices=platforms[0].get_devices()
for device in devices:
print(device)
print(device.get_info(cl.device_info.GLOBAL_MEM_SIZE))
Then base chunking size on card memory. The NCRLTV algorithm needs apr. 9 copies of the processed image size. The image that is processed will be (imagesize + psfsize/2 + overlap) in each dimension. I probably will make a small utility for this, but won't get around to it until next week.
Hi Brian
Thanks for the code above about checking the memory. I have put together this one
import numpy as np
import skimage
from clij2fft.richardson_lucy_dask import richardson_lucy_nc_dask
from clij2fft.richardson_lucy import richardson_lucy, richardson_lucy_nc
import tifffile as tif
import os
import dask.array as da
import pyopencl as cl
def gpu_mem():
platforms = cl.get_platforms()
devices = platforms[0].get_devices()
mem_size = []
for device in devices:
mem_size.append(device.get_info(cl.device_info.GLOBAL_MEM_SIZE))
return min(mem_size)
def rl_mem_footprint(img, psf, depth=(0, 0, 0)):
img_size = np.array(img.shape)
psf_size = np.array(psf.shape)
depth_size = np.array(depth)
total_size = np.prod(img_size + psf_size / 2 + depth_size)
img_bytes = total_size * 4 # float32 needs 4 bytes
img_bytes = img_bytes * 9 # the RL deconv algo needs 9 copies
return np.ceil(img_bytes)
def chunk_factor(img, psf, depth):
"""
If chunks are needed, the 3D image will be chunked along x and y only because as
Brian said the psf could be elongated along z and chunking it on z could create artifacts.
I will be chunking the image in such a manner that I will be creating 4 or 16 or 32 etc
chunks. That means that if I split x by 2 then y will also be split by 2 which
will result in 4 chunks with the same aspect ratio (on xy) as the original image.
Similarly, to get 16 chunks (if needed) I will be splitting both x and y by 4
The image x,y dimensions are assumed to be even-numbered,
ie they can be divided by 2. No check is done for this
"""
gpu_bytes = gpu_mem()
img_bytes = rl_mem_footprint(img, psf, depth)
cf = 1
if gpu_bytes <= img_bytes:
# we wamt to find an integer k such that:
# img_bytes / 4^k <= gpu_bytes
# inflate by 1 byte so that if img_bytes==gpu_bytes then chunks will be produced.
img_bytes = img_bytes + 1
k = np.ceil(np.emath.logn(4, img_bytes/gpu_bytes))
# 4^k is the number of chunks.
# The find how much x and y must be split-by take the square root
cf = np.sqrt(4 ** k)
return cf
def decon_func(psf, algo, iter, regularisation):
if algo == "richardson_lucy":
def decon(img):
return richardson_lucy(img, psf, iter, 0.0)
elif algo == "richardson_lucy_tv":
def decon(img):
return richardson_lucy(img, psf, iter, regularisation)
elif algo == "richardson_lucy_nc":
def decon(img):
return richardson_lucy_nc(img, psf, iter, regularisation)
else:
raise ValueError('Not implemented')
return decon
def deconvolver(img, psf, algo, iter, regularisation):
depth = (0, 6, 6) # hardcoding this, maybe not good idea. O for z since no chunking there, depth on x, y is arbitrary i think??
k = chunk_factor(img, psf, depth)
print(k)
assert k % 2 == 0, "chunking can be done in multiples of 2 only"
chunk_size = (img.shape[0], img.shape[1]/k, img.shape[2]/k)
print('chunk size: %s' % list(chunk_size))
# chunked dask array
img = da.from_array(img, chunks=chunk_size)
print(regularisation)
rl = img.map_overlap(decon_func(psf, algo, iter, regularisation),
depth=depth,
boundary='reflect',
dtype='float32').compute(num_workers=1)
return rl
imgName = r"/path/to/some/image.tif"
psfName=r"path/to/some//psf.tif"
img=skimage.io.imread(imgName)
psf=skimage.io.imread(psfName)
rl = deconvolver(img, psf, "richardson_lucy_tv", iter=50, regularisation=0.0002)
rl[rl >= 2**16] = 2**16-1 # saturate values >= 65536
tif.imwrite('richardson_lucy_uint16.tif', rl.astype(np.uint16), bigtiff=True)
When you got some time, let me know if it makes sense, if I got the memory calculations correct etc...
Thanks
I took the utility functions you wrote and wrapped them into to the code base. See https://github.com/clij/clij2-fft/pull/23
Let me know if you have any other suggestions before I merge this.
Brian
Thanks for accepting them. I spotted some typos:
line 57: change 32
to 64
line 87: change wamt
to want
line 96: change The find
to To find out
I dont have anything more to add. You can also close the issue. I guess there is no point to add anything to the dicussion thread at the image.sc forum. Your new release will cover that.
Many thanks for this library. I am getting an error, I imagine it is a long shot but when I try to call
richardson_lucy_nc
I receiveI do get a return array back but the image looks almost black, very low intensities.
Also when I try
richardson_lucy(img. psf, iterations, regularization_factor)
(this is the total variation algo right?), or the standard Richardson-Lucy (ie without the regularization factor) I receiveIn this case the return image, looks quite the same as the input image
The Test_Installation.py script work fine
Any idea what could have gone wrong please? I can share the data if that helps..
Thank you!