RuntimeError: CUDA out of memory at cell_seg

Hi, I encountered a runtime error during running cell_seg, Here is my codes and outputs:

import os,sys
import pandas as pd
import numpy as np
from natsort import natsorted
import stereo as st
from stereo import image as im
import torch
from stereo.core.ms_data import MSData
from stereo.core.ms_pipeline import slice_generator
import warnings
warnings.filterwarnings('ignore')

os.environ["CUDA_VISIBLE_DEVICES"]="0" 
print('\nsys version is {}'.format(sys.version))
print(torch.cuda.is_available() )

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
%config InlineBackend.figure_format = 'retina'
print("stereo version is {}".format(st.__version__))
print("torch version is {}, device is {}\n".format(torch.__version__, device))

## output 
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

sys version is 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:46:39) 
[GCC 10.4.0]
True
stereo version is 1.1.0
torch version is 1.10.0, device is cuda:0

wkdir = datasets'
srcdir =  os.path.join(wkdir, 'sawouts/P3_noImage')
output =  os.path.join(wkdir, 'sawouts/P3_noImage/041.cellcut.stereopy')

##这里直接使用旧版本SAW处理得到的图像作为输入，后续使用新版本SAW处理得到的空间表达矩阵GEF
image = os.path.join(wkdir, '01.StandardWorkflow_Result/Register/C01932C2_regist.tif')  #'./SS200000135TL_D1_regist.tif'

model = 'models/Cellpose/Deep_Learning_Model_v03/cell_segmetation_v3.0.onnx' 
tissue_seg_model = models/Deep_Learning_Model/weight_tissue_cut_tool_220304.hdf5'
#Deep Learning Model V1
model_v1 = 'models/Cellpose/Deep_Learning_Model_v01/cell_segmetation_v1.0.pth'
output_v1 =  os.path.join(wkdir, 'sawouts/P3_noImage/041.cellcut.stereopy_v1')

im.cell_seg(
    model_path=model_v1,
    img_path=image,
    out_path=output_v1,
    tissue_seg_model_path=tissue_seg_model,
    tissue_seg_method=1,
    num_threads=10,
    gpu=0,
    method="v1"
    )

Here is the running output of im.cell_seg:

[2024-03-04 08:05:04][Stereo][79306][MainThread][22854609426240][cell_seg_pipeline][128][INFO]: C01932C2_regist.tif transfer to 8bit
[2024-03-04 08:05:10][Stereo][79306][MainThread][22854609426240][cell_seg_pipeline][62][INFO]: Transform 16bit to 8bit : 6.54
[2024-03-04 08:05:10][Stereo][79306][MainThread][22854609426240][pipeline][64][INFO]: source image type: ssdna
[2024-03-04 08:05:10][Stereo][79306][MainThread][22854609426240][pipeline][65][INFO]: segmentation method: deep learning
[2024-03-04 08:05:10][Stereo][79306][MainThread][22854609426240][pipeline][266][INFO]: tissueCut_model infer...
[2024-03-04 08:05:10][Stereo][79306][MainThread][22854609426240][pipeline][168][INFO]: image loading and preprocessing...
2024-03-04 08:05:11.179577: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-04 08:05:13.931939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30439 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:25:00.0, compute capability: 7.0
2024-03-04 08:05:35.270228: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 7605
2024-03-04 08:05:35.271985: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-03-04 08:05:35.274326: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-03-04 08:05:35.274387: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2024-03-04 08:05:35.275667: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-03-04 08:05:35.275703: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
[2024-03-04 08:05:54][Stereo][79306][MainThread][22854609426240][pipeline][164][INFO]: seg results saved in /home/datasets/sawouts/P3_noImage/041.cellcut.stereopy_v1/C01932C2_regist_tissue_cut.tif
【image 1/1】
  0%|                                                   | 0/136 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 5
      2 model_v1 = models/Cellpose/Deep_Learning_Model_v01/cell_segmetation_v1.0.pth'
      3 output_v1 =  os.path.join(wkdir, 'sawouts/P3_noImage/041.cellcut.stereopy_v1')
----> 5 im.cell_seg(
      6     model_path=model_v1,
      7     img_path=image,
      8     out_path=output_v1,
      9     tissue_seg_model_path=tissue_seg_model,
     10     tissue_seg_method=1,
     11     num_threads=10, gpu=0,
     12     method="v1"
     13     )

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/stereo/image/segmentation/segment.py:82, in cell_seg(model_path, img_path, out_path, deep_crop_size, overlap, gpu, tissue_seg_model_path, tissue_seg_method, post_processing_workers, is_water, num_threads, need_tissue_cut, method)
     68 if method == VersionType.v1.value:
     69     cell_seg_pipeline = CellSegPipeV1(
     70         model_path,
     71         img_path,
   (...)
     80         post_processing_workers=post_processing_workers,
     81     )
---> 82     cell_seg_pipeline.run()
     83 elif method == VersionType.v3.value:
     84     cell_seg_pipeline = CellSegPipeV3(
     85         model_path,
     86         img_path,
   (...)
     95         post_processing_workers=post_processing_workers,
     96     )

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/stereo/image/segmentation/seg_utils/v1/cell_seg_pipeline_v1.py:115, in CellSegPipeV1.run(self)
    113 t0 = time.time()
    114 # cell segmentation in roi
--> 115 tissue_cell_label = self.tissue_cell_infer()
    116 t1 = time.time()
    117 logger.info('Cell inference : %.2f' % (t1 - t0))

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/stereo/image/segmentation/seg_utils/v1/cell_seg_pipeline_v1.py:75, in CellSegPipeV1.tissue_cell_infer(self, q)
     73 for img, tissue_bbox in zip(self.img_filter, self.tissue_bbox):
     74     tissue_img = [img[p[0]: p[2], p[1]: p[3]] for p in tissue_bbox]
---> 75     label_list = cell_infer.cellInfer(self.model_path, tissue_img, self.deep_crop_size, self.overlap)
     76     tissue_cell_label.append(label_list)
     77 if q is not None:

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/stereo/image/segmentation/seg_utils/v1/cell_infer.py:376, in cellInfer(model_path, file, size, overlap)
    374 img = batch
    375 img = img.to(device, dtype=torch.float)
--> 376 pred_mask = model(img)
    377 pred_mask = torch.sigmoid(pred_mask).detach().cpu().numpy()
    378 pred = pred_mask[:, 0, :, :]

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/stereo/image/segmentation/seg_utils/v1/resnet_unet.py:219, in EpsaResUnet.forward(self, x)
    216 def forward(self, x):
    218     block1 = self.block1(x)
--> 219     block2 = self.block2(block1)
    220     block3 = self.block3(block2)
    221     block4 = self.block4(block3)

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/container.py:141, in Sequential.forward(self, input)
    139 def forward(self, input):
    140     for module in self:
--> 141         input = module(input)
    142     return input

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/container.py:141, in Sequential.forward(self, input)
    139 def forward(self, input):
    140     for module in self:
--> 141         input = module(input)
    142     return input

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/stereo/image/segmentation/seg_utils/v1/models/epsanet.py:123, in EPSABlock.forward(self, x)
    120 out = self.bn1(out)
    121 out = self.relu(out)
--> 123 out = self.conv2(out)
    124 out = self.bn2(out)
    125 out = self.relu(out)

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/stereo/image/segmentation/seg_utils/v1/models/epsanet.py:65, in PSAModule.forward(self, x)
     63 x2 = self.conv_2(x)
     64 x3 = self.conv_3(x)
---> 65 x4 = self.conv_4(x)
     67 feats = torch.cat((x1, x2, x3, x4), dim=1)
     68 feats = feats.view(batch_size, 4, self.split_channel, feats.shape[2], feats.shape[3])

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/conv.py:446, in Conv2d.forward(self, input)
    445 def forward(self, input: Tensor) -> Tensor:
--> 446     return self._conv_forward(input, self.weight, self.bias)

File ~/miniforge3/envs/stereopy1.1/lib/python3.8/site-packages/torch/nn/modules/conv.py:442, in Conv2d._conv_forward(self, input, weight, bias)
    438 if self.padding_mode != 'zeros':
    439     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    440                     weight, bias, self.stride,
    441                     _pair(0), self.dilation, self.groups)
--> 442 return F.conv2d(input, weight, bias, self.stride,
    443                 self.padding, self.dilation, self.groups)

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 31.75 GiB total capacity; 611.97 MiB already allocated; 19.69 MiB free; 620.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is my GPU infos and conda environment:

$ nvidia-smi 
Mon Mar  4 08:06:13 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  Off  | 00000000:25:00.0 Off |                    0 |
| N/A   39C    P0    37W / 250W |  32491MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

$ gpustat 
n1                        Mon Mar  4 08:06:16 2024  515.86.01
[0] Tesla V100S-PCIE-32GB | 39'C,   0 % | 32748 / 32768 MB |

$  mamba list | egrep 'torch|tensor|cuda'
cudatoolkit               10.2.89             h713d32c_10    conda-forge
pytorch                   1.10.0          cuda102py38h17946ce_1    conda-forge
tensorboard               2.6.0              pyhd8ed1ab_1    conda-forge
tensorboard-data-server   0.6.1            py38h2b5fc30_4    conda-forge
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
tensorflow                2.7.0           cuda102py38h32e99bf_0    conda-forge
tensorflow-base           2.7.0           cuda102py38h021f141_0    conda-forge
tensorflow-estimator      2.7.0           cuda102py38h4357c17_0    conda-forge
torchvision               0.11.1               py38_cu102    pytorch

My GPU is Nvidia V100S with 32GB memory, could you please help me figure it out?

STOmics / Stereopy

RuntimeError: CUDA out of memory at cell_seg #246