TypeError in RunCellpose in GPU mode

Running the latest version of RunCellpose 2b3d1332e4b48fcdc12ee314421bcde5b601556a with the attached pipeline on the welcome screen example images yields the following error:

Traceback (most recent call last):
  File "C:\Users\brayma2\OneDrive - Novartis Pharma AG\github\cellprofiler-plugins\runcellpose.py", line 373, in run
    y_data, flows, *_ = model.eval(
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\models.py", line 286, in eval
    masks, flows, styles = self.cp.eval(x,
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\models.py", line 629, in eval
    masks, styles, dP, cellprob, p, bd, tr = self._run_cp(x,
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\models.py", line 695, in _run_cp
    yf, style = self._run_nets(img, net_avg=net_avg,
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\core.py", line 405, in _run_nets
    y0, style = self._run_net(img, augment=augment, tile=tile,
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\core.py", line 479, in _run_net
    y, style = self._run_tiled(imgs, augment=augment, bsize=bsize,
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\core.py", line 580, in _run_tiled
    y0, style = self.network(IMG[irange], return_conv=return_conv)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\core.py", line 347, in network
    y, style = self.net(X)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\resnet_torch.py", line 200, in forward
    T0 = self.upsample(style, T0, self.mkldnn)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\resnet_torch.py", line 167, in forward
    x = self.up[n](x, xd[n], style, mkldnn=mkldnn)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellpose\resnet_torch.py", line 118, in forward
    x = self.proj(x) + self.conv[1](style, self.conv[0](x) + y, mkldnn=mkldnn)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
    input = module(input)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\torch\nn\modules\conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 4.00 GiB total capacity; 317.22 MiB already allocated; 2.01 GiB free; 409.60 MiB allowed; 360.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I'm less concerned about the above error, which seems outside of scope here (although if you have suggestions on how to fix, I'm all ears). However, it's the error following that I wanted to bring up:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\brayma2\onedrive - novartis pharma ag\github\cellprofiler\master\cellprofiler\gui\pipelinecontroller.py", line 3390, in do_step
    self.__pipeline.run_module(module, workspace_model)
  File "C:\Users\brayma2\.conda\envs\py38\lib\site-packages\cellprofiler_core\pipeline\_pipeline.py", line 1298, in run_module
    module.run(workspace)
  File "C:\Users\brayma2\OneDrive - Novartis Pharma AG\github\cellprofiler-plugins\runcellpose.py", line 386, in run
    except float(cellpose_ver[0:3]) >= 0.7 and int(cellpose_ver[0])<2:
TypeError: catching classes that do not inherit from BaseException is not allowed

Hi @braymp

I'm not a developer on this but I spent some time troubleshooting this because I was suffering from the same issue.

Because of how the script is coded, it doesn't have a handle on what to do if CUDA is out of memory. That is the BaseException, but the script is only checking to determine what version of cellpose you have in your environment in its exception statement (hence the except float () ) so it can determine if it will use omnipose/older cellpose. As a result, the whole script will crash there.

To fix this without any changes to the code, you're going to have to reduce the amount of workers you have and reduce the GPU memory share per worker. On my machine, each worker seemed to need ~1.4 gb of GPU memory. Based off that and your GPU memory of 4 gb, I think you will probably only have room for just two workers and you'll specify them to use 0.5 of the GPU memory share. You might be better off just using the CPU in your instance if that's the case.

I can't really think of a clean way to fix this issue with multiple workers accessing the same limited GPU memory, but if you want it to not crash while using more than 2 workers, you'll have to modify the except statement to handle this error. I did this by converting some code to form a loop with an except statement so the module will keep trying until it finishes (changes to line 372 - 407). I had to get rid of the ability to use omnipose, but I don't use that so it doesn't really affect my use. You can check it out on my fork of the script here https://github.com/kochild/CellProfiler-plugins/blob/master/runcellpose.py if you want to try it. My hack will eventually cause staggering of the workers so they'll soon perform the masks at an optimal speed (~5 seconds) after the first workers complete the module.

I haven't done any exhaustive benchmarking with this, but so far I think it outperforms 10 workers on our intel i9 3.7 GHz on CPU only mode. I can also use 5 workers instead of just 3 with my GPU, so there is a boost there.

I have had the same problem. I tried using the fork above (thanks @kochild!), but it did not work for me (got stuck in FINDING MASKS for ever and I had to force close the program). That being said, I am sure there is more tweaking I can do to my code to make sure it matches up with your work-around. And thanks @braymp for bring this thread to my attention!

Hey @JulieNec

Oh yeah, the hack isn't perfect, but works great with the 5 GB GPU we have. A nicer solution may be to have the workers know that they have to wait for the RunCellpose module to finish in one worker before running executing it on another on systems with limited GPU memory. But that's a more complex fix and would require workers to know what others worker are doing and I can't put in the effort to figure that out/learn it at this moment.

I think in your case you may want to try playing with the amount of workers/ GPU allocation. I found that on our Linux Workstation with a Quadro P620, a GPU with 2 GB of memory (nvidia-smi stating 375 MiB/1991 MiB before even running Cellprofiler), that if I set Cellprofiler to use 2 workers in the preferences and set the GPU memory share for each worker to 1 in the module, it runs for me.

Also, try restarting Cellprofiler if a CUDA error occurs. The nvidia-smi tool revealed to me that python will hold/allocate GPU memory for the amount of workers you may have previously specified, even after you reduce the amount of workers from 3 to 2. That's memory you can't get back until you restart Cellprofiler. See below

$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 450.66       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P620         Off  | 00000000:01:00.0  On |                  N/A |
| 43%   59C    P0    N/A /  N/A |   1902MiB /  1991MiB |     28%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8829      C   ...nda3/envs/cp42/bin/python      463MiB |
|    0   N/A  N/A      8987      G   /usr/bin/X                        173MiB |
|    0   N/A  N/A      9958      G   /usr/bin/gnome-shell               16MiB |
|    0   N/A  N/A     13815      C   ...nda3/envs/cp42/bin/python      565MiB |
|    0   N/A  N/A     13817      C   ...nda3/envs/cp42/bin/python      561MiB |
|    0   N/A  N/A     21720      G   /usr/bin/gnome-shell               26MiB |
|    0   N/A  N/A     26483      G   /usr/bin/X                         91MiB |
+-----------------------------------------------------------------------------+

Running my pipelines with this 2 GB GPU with 2 workers is way faster than CPU only with 6 workers. In fact it is 2.5X faster. But there is a chance a CUDA error will stall the analysis at times so you have to pick your battles. If you can run the pipeline overnight on the CPU, that's probably the safest/easiest option.

Also, I will say that this plugin might work best on Windows that has a shared GPU memory pool. I'm not sure what changed since the last time I troubleshooted this, but our Windows workstation can now properly utilize the computer's shared GPU memory pool, giving us 20.9 GB to work with vs the card's 5.0 GB. There's actually no longer an issue running 10 workers on the original python code. Weird.

Hey @kochild,

Thank you for the tips! I tried reducing amount of workers but forgot to change the GPU allocation in the module itself (🤦‍♀️). Using my 4GB Quadro P1000 (using 400MiB/4096MiB before running CP), if I allocate GPU memory share to 1.0 per worker, I can run with 3 workers without an error. I have pretty large files so maybe that is limiting me as well. I don't currently have access to a workstation with a shared GPU memory pool but will see if I can access one. But I'm glad you can run yours now with lots of GPU and no errors!

Thanks for all the help, Julie

We appreciate all the discussion about RunCellpose. We hope that with the newer version of RunCellpose with docker most of these problems will be solved (new documentation here). If any questions remain, please feel free to open a new issue! @ErinWeisbart

CellProfiler / CellProfiler-plugins

TypeError in RunCellpose in GPU mode #154