HenriquesLab / NanoPyx

Nanoscopy library for Python (NanoPyx, the successor to NanoJ) - focused on light microscopy and super-resolution imaging
Creative Commons Attribution 4.0 International
62 stars 8 forks source link

help with selecting correct gpu device #115

Open simonecoppola opened 2 days ago

simonecoppola commented 2 days ago

Hi, I've been testing the nanopyx package for eSRRF but having some issues trying to make it run on my GPU (this usually runs fine in the ImageJ counterpart).

I followed the installation instructions pip install nanopyx[all] and then pip install cupy-cuda-12x. When I run the jupyter notebook for eSRRF, I can see usage spike on the integrated graphics, but nothing on the GPU card.

Is there a command to select which GPU device to run the eSRRF command on?

Thanks for your help!

An update on this: I also tried disabling completely the integrated graphics the device manager. When I do that, eSRRF runs on the CPU.

brunomsaraiva commented 2 days ago

Hi @simonecoppola,

Thanks for reaching out :) Could you please create a new cell, run the following command and check whether you see the information regarding your GPU card: from nanopyx import print_opencl_info print_opencl_info()

Can you confirm that your dedicated GPU device appears there?

simonecoppola commented 2 days ago

Thanks for the quick response @brunomsaraiva - I can confirm the GPU card appears when I run that!

Here's the output: '\n============================================================\nOpenCL Platforms and Devices \n============================================================\nPlatform - Name: NVIDIA CUDA\nPlatform - Vendor: NVIDIA Corporation\nPlatform - Version: OpenCL 3.0 CUDA 12.4.131\nPlatform - Profile: FULL_PROFILE\n\t--------------------------------------------------------\n\tDevice - Name: NVIDIA RTX A2000\n\tDevice - Type: ALL | GPU\n\tDevice - Max Clock Speed: 1200 Mhz\n\tDevice - Compute Units: 26\n\tDevice - Local Memory: 48 KB\n\tDevice - Constant Memory: 64 KB\n\tDevice - Global Memory: 6 GB\n\tDevice - Max Buffer/Image Size: 1534 MB\n\tDevice - Max Work Group Size: 1024\n============================================================\nPlatform - Name: OpenCLOn12\nPlatform - Vendor: Microsoft\nPlatform - Version: OpenCL 3.0 D3D12 Implementation\nPlatform - Profile: FULL_PROFILE\n\t--------------------------------------------------------\n\tDevice - Name: NVIDIA RTX A2000\n\tDevice - Type: ALL | GPU\n\tDevice - Max Clock Speed: 12 Mhz\n\tDevice - Compute Units: 1\n\tDevice - Local Memory: 32 KB\n\tDevice - Constant Memory: 64 KB\n\tDevice - Global Memory: 6 GB\n\tDevice - Max Buffer/Image Size: 1024 MB\n\tDevice - Max Work Group Size: 1024\n\t--------------------------------------------------------\n\tDevice - Name: Microsoft Basic Render Driver\n\tDevice - Type: ALL | CPU\n\tDevice - Max Clock Speed: 12 Mhz\n\tDevice - Compute Units: 1\n\tDevice - Local Memory: 32 KB\n\tDevice - Constant Memory: 64 KB\n\tDevice - Global Memory: 32 GB\n\tDevice - Max Buffer/Image Size: 1024 MB\n\tDevice - Max Work Group Size: 1024\n'

brunomsaraiva commented 2 days ago

Ok, so then it's our automatic calculation of what would be the best GPU device that is picking the wrong device. Could you please send us the output of: from nanopyx.__opencl__ import _fastest_device print(_fastest_device)

In the meantime I'm looking through that part of our code base to see if I spot any bug

simonecoppola commented 2 days ago

The output of that function seems right: {'device': <pyopencl.Device 'NVIDIA RTX A2000' on 'NVIDIA CUDA' at 0x1fe5fb3a470>, 'DP': False}

But it's definitely not running on the GPU - usage is at 0-1%!

brunomsaraiva commented 2 days ago

Ok, so then it's picking up the right device. Is there any error output when you try to run the jupyter notebook? Could you try to run the following code:

import numpy as np
from nanopyx.core.transform._le_esrrf import eSRRF
from nanopyx.__opencl__ import _fastest_device
esrrf = eSRRF()
img = np.random.random((10, 100, 100)).astype(np.float32)
esrrf._run_opencl(img, device=_fastest_device)

If it runs as expected on the NVIDIA GPU, can you try setting the dimensions of the random image to the same dimensions of the data you're currently trying to use? np.random.random((frames, height, width))

Thanks for the help in finding the cause of the issue :)

simonecoppola commented 1 day ago

Thank you for helping me troubleshoot this!

I ran the code snippet you gave me, and it ran fine for the (10, 100, 100) size, but when I ran it with the size of the image I was originally trying to process it outputted the error: Error: Buffer size is larger than device maximum memory allocation size

Is it not possible to run eSRRF in nanopyx on large images? I was able to process the image ok when using eSRRF in ImageJ

I've also done some further testing and observed the following: if I run this code snippet:

import numpy as np
from nanopyx.methods import eSRRF
img = np.random.random((50, 100, 100)).astype(np.float32)
result = eSRRF(img, magnification=5, radius=1.5,
                   sensitivity=2,
                   doIntensityWeighting=True)

it takes a while for the code to complete, and at the end I get this output Agent: eSRRF_ST using unthreaded ran in 23.723267900059 seconds

whereas, if I add the argument _force_run_type='opencl' the code runs really fast, and I get the following output Agent: eSRRF_ST using opencl ran in 0.2890083999373019 seconds which looks like what I'm expecting!

brunomsaraiva commented 1 day ago

It should be possible to run exactly the same images in NanoJ and NanoPyx so we might have a bug somewhere creating these issues. As of now I see two possibilities:

  1. there is an issue with pyopencl allocating memory in your gpu and is causing it to fail
  2. something in our side is not selecting the right device/implementation

To help find the root of the issue could you please tell us whats the image dimension that you're currently trying to use? Also, I created a new version of the notebook here. This version forces the usage of opencl and the _fastest_device (as in the script I sent you) could you try to run it and tell us if you're getting the correct output? Also in case it fails, could you try on a cropped version of your image?

Thanks, Bruno

simonecoppola commented 1 day ago

The image size is 100x1028x1028 px.

I tried running it on the notebook you linked and it doesn't seem to work - the when I press the run button it switches to "Running..." but after several tens of minutes it is still not complete, and I can see my CPU and GPU usage at 1-2% so perhaps it silently fails?

I tried running it with a cropped version of the image (50x168x168) and that ran as expected, so no problems there

brunomsaraiva commented 1 day ago

If you run it for a single frame of 1028x1028, does it work?

It's a bit perplexing because I see from your output of print_opencl_info that you have 6Gb of video memory and even the full 100x1028x1028 px stack with 10x magnification (which I assume is higher than what you're currently trying) shouldn't need more than 4Gb of memory. And even if it didn't eSRRF is prepared to separate the frames in chunks so that it can fit inside GPU memory (as long as a single frame fits).

Are you using the most recent version of your GPU drivers?

In case a single frame works, I would suggest this workaround until we can pinpoint what's causing the issue:

import numpy as np
from tifffile import imread, imwrite
from nanopyx.methods import eSRRF
from nanopyx.core.transform.sr_temporal_correlations import calculate_eSRRF_temporal_correlations

magnification = 5 # current default value but feel free to change it to your needs
esrrf_order = "AVG" # default value but other options are "VAR" and "TAC2"
frames_per_timepoint = 100 # change to prefered

input_image = imread("input_image_path")
input_shape = input_image.shape
output = []

for i in range(input_shape[0] // frames_per_timepoint):
    block = input_image[i*frames_per_timepoint:(i+1)*frames_per_timepoint]
    esrrf_frames = np.zeros((block.shape[0], input_shape[1]*magnification, input_shape[2]*magnification)).astype(np.float32)

    for frame in range(block.shape[0]):
        esrrf_frames[frame] = eSRRF(input_image[frame], magnification=magnification, _force_run_type='opencl')[0]
    output.append(calculate_eSRRF_temporal_correlations(esrrf_frames, esrrf_order))

output = np.array(output)
imwrite("path_to_output.tif", output)

In case you prefer this in notebook form I'll happily provide you with a modified version of the notebook that performs this frame by frame calculation.

Best, Bruno

simonecoppola commented 23 hours ago

Yes, the drivers are the latest available!

I was able to run it on a single frame, and the workaround you suggested works fine!

Thanks for your help with this :)

simonecoppola commented 22 hours ago

Also related to this issue, I tried setting up nanopyx on another one of our lab's workstations, to see if there was anything specific to that GPU that wasn't playing nice

from nanopyx import print_opencl_info
from nanopyx.__opencl__ import _fastest_device
print(print_opencl_info())
print(_fastest_device)

and the output is the following:

============================================================
OpenCL Platforms and Devices 
============================================================
Platform - Name:  NVIDIA CUDA
Platform - Vendor:  NVIDIA Corporation
Platform - Version:  OpenCL 3.0 CUDA 12.6.65
Platform - Profile:  FULL_PROFILE
    --------------------------------------------------------
    Device - Name: NVIDIA GeForce RTX 3060
    Device - Type: ALL | GPU
    Device - Max Clock Speed:  1777 Mhz
    Device - Compute Units:  28
    Device - Local Memory:  48 KB
    Device - Constant Memory:  64 KB
    Device - Global Memory: 12 GB
    Device - Max Buffer/Image Size: 3072 MB
    Device - Max Work Group Size: 1024
============================================================
Platform - Name:  Intel(R) OpenCL Graphics
Platform - Vendor:  Intel(R) Corporation
Platform - Version:  OpenCL 3.0 
Platform - Profile:  FULL_PROFILE
    --------------------------------------------------------
    Device - Name: Intel(R) UHD Graphics 770
    Device - Type: ALL | GPU
    Device - Max Clock Speed:  1650 Mhz
    Device - Compute Units:  32
    Device - Local Memory:  64 KB
    Device - Constant Memory:  4194296 KB
    Device - Global Memory: 51 GB
    Device - Max Buffer/Image Size: 4096 MB
    Device - Max Work Group Size: 512

{'device': <pyopencl.Device 'Intel(R) UHD Graphics 770' on 'Intel(R) OpenCL Graphics' at 0x17968b0ad00>, 'DP': False}

I was able to deal with this by disabling the integrated graphics completely in the device manager, so that now running the script gives this

============================================================
OpenCL Platforms and Devices 
============================================================
Platform - Name:  NVIDIA CUDA
Platform - Vendor:  NVIDIA Corporation
Platform - Version:  OpenCL 3.0 CUDA 12.6.65
Platform - Profile:  FULL_PROFILE
    --------------------------------------------------------
    Device - Name: NVIDIA GeForce RTX 3060
    Device - Type: ALL | GPU
    Device - Max Clock Speed:  1777 Mhz
    Device - Compute Units:  28
    Device - Local Memory:  48 KB
    Device - Constant Memory:  64 KB
    Device - Global Memory: 12 GB
    Device - Max Buffer/Image Size: 3072 MB
    Device - Max Work Group Size: 1024

{'device': <pyopencl.Device 'NVIDIA GeForce RTX 3060' on 'NVIDIA CUDA' at 0x1f14ef4dde0>, 'DP': False}

But then if I try to run like I did on the previous computer, with the "large" image size

import numpy as np
from nanopyx.core.transform._le_esrrf import eSRRF
from nanopyx.__opencl__ import _fastest_device
esrrf = eSRRF()
img = np.random.random((10, 1028, 1028)).astype(np.float32)
esrrf._run_opencl(img, device=_fastest_device)

I get the buffer allocation size error

Error: Buffer size is larger than device maximum memory allocation size
brunomsaraiva commented 20 hours ago

Hi @simonecoppola, Thanks for trying it on a different computer, it was very helpful as I just realized by calling eSRRF through opencl using ._run_opencl is actually bypassing our memory management, creating those buffer size erros, so we will fix that on a future NanoPyx release (likely next couple of weeks).

Now the good news, the problem in the new computer should be a different one from the initial one, as on the new computer _fastest_device actually has the wrong device there. For that I've updated our estimation of the fastest device and released a new version of the liquid_engine package that assigns a penalty to intel integrated cards, you can update to the newest one by running: pip install --upgrade liquid_engine==0.1.9

The bad news is that for the first computer I'm still puzzled with the issue. Because fastest device is actually grabbing the desired card but then bypassing it when it is actual time to run, I've been trying to reproduce it in our Windows machines with NVIDIA cards but without success until now.

I'll ping you here if I find something else :)