Error while running with --gpu mode

marcnol commented 5 months ago

Hey Erik, congratulations on the paper coming out ! great news ;)

I have installed the latest version (0.4) in our systems (ARCH/ubuntu) with some twickling the compilation with CL worked out fine. See output of dw --version

deconwolf: '0.4.0'
BUILD_DATE: 'Jun  7 2024'
FFT Backend: 'fftw-3.3.10-sse2-avx'
TIFF Backend: 'LIBTIFF, Version 4.6.0
Copyright (c) 1988-1996 Sam Leffler
Copyright (c) 1991-1996 Silicon Graphics, Inc.'
OpenMP: YES
OpenCL: YES
VkFFT: YES

Runs without the --gpu option work great.

However, when I try to run files with --gpu I get the following error:

$ Running dw --threads 20 --iter 50 --no-inplace scan_001_RT11_000_ROI_ch00.tif PSF_515.tif --overwrite --out scan_001_RT11_000_ROI_converted_decon_ch00.tif --gpu
Reading /mnt/grey/DATA/ProcessedData_2024/Experiment_49_David_RAMM_DNAFISH_proto_test/raw/input/scan_001_RT11_000_ROI_ch00.tif
Reading /mnt/grey/DATA/ProcessedData_2024/Experiment_49_David_RAMM_DNAFISH_proto_test/raw/input/PSF_515.tif
PSF Z-crop [181 x 181 x 219] -> [181 x 181 x 129]
Output: scan_001_RT11_000_ROI_converted_decon_ch00.tif(.log.txt)
Deconvolving using shbcl2 (using inplace)
image: [2048x2048x65], psf: [181x181x129], job: [2228x2228x193]
VkFFT failed with error 4039
On no... bad new from OpenCL:
errinfo: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 2080 Ti (Device 0).

Do you know if this installation related??

thanks again and great work!

BTW, here is the full log file for this run:

2024-06-07T10:06:53
> Settings:
image:  /mnt/grey/DATA/ProcessedData_2024/Experiment_49_David_RAMM_DNAFISH_test/raw/input/scan_001_DAPI_001_ROI_ch00.tif
psf:    /mnt/grey/DATA/ProcessedData_2024/Experiment_49_David_RAMM_DNAFISH_test/raw/input/PSF_460.tif
output: scan_001_DAPI_001_ROI_converted_decon_ch00.tif
log file: scan_001_DAPI_001_ROI_converted_decon_ch00.tif.log.txt
nIter:  50
nThreads for FFT: 20
nThreads for OMP: 20
verbosity: 1
background level: auto
method: Scaled Heavy Ball + OpenCL (SHBCL2)
metric: Idiv
Stopping after 50 iterations
overwrite: YES
tiling: OFF
XY crop factor: 0.001000
Offset: 5.000000
Output Format: 16 bit integer
Scaling: Automatic
Border Quality: 2 Minimal boundary artifacts
FFT lookahead: 0
FFTW3 plan: FFTW_MEASURE
Initial guess: Flat average
deconwolf: '0.4.0'
PID: 4056830
PWD: /mnt/grey/DATA/ProcessedData_2024/Experiment_49_David_RAMM_DNAFISH_test/raw/input
CMD: dw --threads 20 --iter 50 --no-inplace scan_001_DAPI_001_ROI_ch00.tif PSF_460.tif --overwrite --out scan_001_DAPI_001_ROI_converted_decon_ch00.tif --gpu
BUILD_DATE: 'Jun  7 2024'
FFT Backend: 'fftw-3.3.8-sse2-avx'
TIFF Backend: 'LIBTIFF, Version 4.1.0
Copyright (c) 1988-1996 Sam Leffler
Copyright (c) 1991-1996 Silicon Graphics, Inc.'
USER: 'marcnol'
HOSTNAME: 'darwin'
OpenMP: YES
OpenCL: YES
VkFFT: YES

Set the number of OMP threads to 20
Using static scheduling for OMP
Warning: TIFFTAG_SAMPLEFORMAT not specified, assuming uint but that could be wrong!
PSF Z-crop [181 x 181 x 219] -> [181 x 181 x 129]
Using fftw-3.3.8-sse2-avx with 20 threads
FFTW wisdom file: /home/marcnol/.config/deconwolf/fftw_wisdom_float_threads_20.dat
Importing FFTW wisdom
Deconvolving with shbcl2
image: [2048x2048x65]
psf: [181x181x129]
job: [2228x2228x193] (958048912 voxels)

elgw commented 5 months ago

Hello!

The error messages are a bit cryptic -- clearly a point for improvements.

However, it looks like there was not enough memory on the gpu. Try with a smaller image to know for sure. Typically dw with --gpu require as much gpu memory as it needs RAM when --gpu is not specified. You could check the log files for that number, it will be found at the end.

The image can also be processed in tiles, try for example --tilesize 1024, hopefully that will work. It is possible to compromise on the boundary handling with --bq 1 or even --bq 0, for some use case that is ok, but it depends on later image analysis steps in the pipeline.

The --gpu option has been tested only with a few cards for Nvidia and AMD so far so there might still be bugs to discover. There is a later version of vkFFT and I plan to upgrade to that one soon. That could potentially resolve some issues.

marcnol commented 5 months ago

Thanks for the prompt answer !

I am surprised that it is a memory issue as the image is 590Mb and the GPU has 11Gb (...we never had troubles deconvolving these images with huygens, but the algorithm may be different !)

I tried to use --tilesize with 1024, 512, and now even 128... but in all cases it gets stuck (see Traceback below).

How much memory did the GPU cards you tested had? Ours is an NVIDIA GeForce RTX 2080 Ti.

Thanks again Erik !

Log for --tilesize 128:

-> Processing tile 29 / 256
Deconvolving using shbcl2 (using inplace)
image: [168x168x65], psf: [181x181x129], job: [348x348x193]
Iteration  50/ 50, Idiv=7.338e-01            
-> Processing tile 30 / 256
Deconvolving using shbcl2 (using inplace)
image: [168x168x65], psf: [181x181x129], job: [348x348x193]
Iteration  50/ 50, Idiv=9.189e-01            
-> Processing tile 31 / 256
Deconvolving using shbcl2 (using inplace)
image: [168x168x65], psf: [181x181x129], job: [348x348x193]
Iteration  50/ 50, Idiv=9.570e-01            
-> Processing tile 32 / 256
Deconvolving using shbcl2 (using inplace)
image: [168x148x65], psf: [181x181x129], job: [348x328x193]
Iteration  50/ 50, Idiv=7.720e-01            
-> Processing tile 33 / 256
Deconvolving using shbcl2 (using inplace)
image: [168x148x65], psf: [181x181x129], job: [348x328x193]
.On no... bad new from OpenCL:
errinfo: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_COPY_BUFFER on NVIDIA GeForce RTX 2080 Ti (Device 0).

Sorry! There was an unrecoverable error!
   File: /home/marcnol/Repositories/deconwolf/src/cl_util.c
   Function: fimcl_copy at line 301
   OpenCl error=CL_MEM_OBJECT_ALLOCATION_FAILURE

   CL_MEM_OBJECT_ALLOCATION_FAILURE indicates that there was not
   enough memory on the GPU to continue. Try with a smaller image
   and look up the option --tilesize

   If you are sure that OpenCL works on this machine
   and that it is a problem only related to deconwolf,
   check open issues or create a new one at
   https://github.com/elgw/deconwolf/issues
On no... bad new from OpenCL:
errinfo: CL_MEM_OBJECT_ALLOCATION_FAILURE error executing CL_COMMAND_COPY_BUFFER on NVIDIA GeForce RTX 2080 Ti (Device 0).

elgw commented 5 months ago

I use a 12 GB card at home (AMD 6700) and have never see this issue... in the office we do all deconvolution on a 24 GB Nvidia 3080...

Tile 32 and 33 have the same size and should require as much memory so this looks strange to me and indicates that I have a bug to hunt down (memory not released properly).

My guess is that --tilesize 512 should do just fine with your GPU, so there is no reason to go below that (I bet that the first tile is processed without problems).

I will investigate this next week and get back to you. Sorry for the poor first experience!

Cheers, Erik

marcnol commented 5 months ago

Hey,

I think I found the problem, at -tilesize 512 it is able to do 10-11 iterations of the 16 it needs to do.

When I monitor GPU usage you see that the deconvolution of each sub-image uses only a small amount of memory, but that from cycle to cycle the memory is not cleared out... so this accumulates over time and ends up occupying all the available memory...

See three snapshots of GPU usage at different time points below. Also the log of the execution is attached.

cheers

marcelo

out.txt

elgw commented 5 months ago

Thanks, that is useful to know.

elgw commented 5 months ago

Hello again,

Here is an update from my side:

I could confirm the issue on my machine. The "bug" has probably been there for a long time but only manifests itself when the GPU memory usage is close to the limit and when tiling is enabled, hence it went under the radar.
The issue is resolved from Version 0.4.1 which you find in the dev branch for the time being. It will make it to the main branch as soon as I've run some more tests (but I don't expect any regressions in particular).

What was wrong?

There was one large OpenCL buffer that was not freed because the object owning it was freed with the wrong method.
There was a smaller buffer that was never released at all.

Preventing it in the future:

From v 0.4.1 dw keeps track on the number of GPU allocations and releases and checks that they match after each tile is deconvolved.
There are of course more things that could be done to prevent silly mistakes like this, the future will tell what I find time to improve :)

Thank you so much for spotting the problem and caring to report!

elgw commented 5 months ago

It works with my AMD card but memory is still not released properly with an Nvidia card. Still investigating.

elgw commented 5 months ago

The memory release was blocked on Nvidia due to a missing clReleaseEvent. With that in place it looks good on both AMD and Nvidia.

Screenshots of nvtop while deconvolving 2048 x 2048 x P images using tiles of size 1024:

marcnol commented 5 months ago

I just tried version 0.4.1 in two NVIDIA GPUs and on two linux distros (ubuntu-like+arch) and the bug does not appear anymore. Given the memory of my GPU (11Gb) and of my images (590Mb) I had to use a tilesize of 1024. There is no more memory leakage and dw does properly deconvolve these images.

Thanks Erik for your prompt response!

elgw / deconwolf

Error while running with --gpu mode #62