Project-MONAI / research-contributions

Implementations of recent research prototypes/demonstrations using MONAI.
https://monai.io/
Apache License 2.0
1.03k stars 336 forks source link

Swin UNETR: 'RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR' #153

Closed naayem closed 2 years ago

naayem commented 2 years ago

Hello, I git cloned the repository into my server. I created a conda environment with python= 3.7.7 In the directory /home/naayem/Projects/swin/research-contributions/SwinUNETR/BTCV I installed the requirements into the environment with pip install -r requirements.txt I then run the command:

python main.py --json_list=/scratch/izar/naayem/ct_scans/data/nnUNet_raw_data_base/nnUNet_raw_d
ata/Task501_WORD-V0.1.0/dataset_test.json --data_dir=/scratch/izar/naayem/ct_scans/data/nnUNet_raw_data_ba
se/nnUNet_raw_data/Task501_WORD-V0.1.0 --feature_size=48 --use_ssl_pretrained --roi_x=96 --roi_y=96 --roi_
z=96  --use_checkpoint --batch_size=1 --max_epochs=1000 --save_checkpoint

and I get the error below. Am I the only one to get this kind of error. I didn't find any similar error in the issues. And I am perplex with what seems to me a difficult debugging.

(testswin) python main.py --json_list=/scratch/izar/naayem/ct_scans/data/nnUNet_raw_data_base/nnUNet_raw_d
ata/Task501_WORD-V0.1.0/dataset_test.json --data_dir=/scratch/izar/naayem/ct_scans/data/nnUNet_raw_data_ba
se/nnUNet_raw_data/Task501_WORD-V0.1.0 --feature_size=48 --use_ssl_pretrained --roi_x=96 --roi_y=96 --roi_
z=96  --use_checkpoint --batch_size=1 --max_epochs=1000 --save_checkpoint                                 
Loading dataset: 100%|████████████████████████████████████████████████████| 24/24 [00:20<00:00,  1.18it/s]
0  gpu 0                                                                                                  
Batch size is: 1 epochs 1000                                                                              
Tag 'module.' found in state dict - fixing!                                                               
Using pretrained self-supervised Swin UNETR backbone weights !                                            
/home/naayem/miniconda3/envs/testswin/lib/python3.7/site-packages/monai/transforms/post/array.py:176: User
Warning: `to_onehot=True/False` is deprecated, please use `to_onehot=num_classes` instead.                
  warnings.warn("`to_onehot=True/False` is deprecated, please use `to_onehot=num_classes` instead.")      
Total parameters count 62187296                                                                           
Writing Tensorboard logs to  ./runs/test                                                                  
0 Tue Nov 29 17:16:21 2022 Epoch: 0      
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [28,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [29,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [30,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [31,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [92,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [93,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [94,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [95,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.                           
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [124,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [125,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [126,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [348,0,0], thread: [127,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [508,0,0], thread: [124,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [508,0,0], thread: [125,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [508,0,0], thread: [126,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [508,0,0], thread: [127,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [364,0,0], thread: [63,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1142,0,0], thread: [84,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1142,0,0], thread: [85,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [1142,0,0], thread: [86,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [92,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [93,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [94,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [95,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [382,0,0], thread: [63,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [28,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [29,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [30,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [456,0,0], thread: [31,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [382,0,0], thread: [98,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [382,0,0], thread: [99,0,0] As
sertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [382,0,0], thread: [100,0,0] A
ssertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "main.py", line 251, in <module>
    main()
  File "main.py", line 105, in main
    main_worker(gpu=0, args=args)
  File "main.py", line 245, in main_worker
    post_pred=post_pred,
  File "/home/naayem/Projects/swin/research-contributions/SwinUNETR/BTCV/trainer.py", line 160, in run_tra
ining
    model, train_loader, optimizer, scaler=scaler, epoch=epoch, loss_func=loss_func, args=args
  File "/home/naayem/Projects/swin/research-contributions/SwinUNETR/BTCV/trainer.py", line 44, in train_ep
och
    scaler.scale(loss).backward()
  File "/home/naayem/miniconda3/envs/testswin/lib/python3.7/site-packages/torch/_tensor.py", line 488, in 
backward
    self, gradient, retain_graph, create_graph, inputs=inputs
  File "/home/naayem/miniconda3/envs/testswin/lib/python3.7/site-packages/torch/autograd/__init__.py", lin
e 199, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, p
lease include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 48, 96, 96, 96], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(48, 14, kernel_size=[1, 1, 1], padding=[0, 0, 0], stride=[1, 1, 1], dilation=[1, 1, 
1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 48, 96, 96, 96], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(48, 14, kernel_size=[1, 1, 1], padding=[0, 0, 0], stride=[1, 1, 1], dilation=[1, 1, 
1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    memory_format = Contiguous
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 1]
    dilation = [1, 1, 1]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x2b87c4038000
    type = CUDNN_DATA_HALF 
    nbDims = 5
    dimA = 4, 48, 96, 96, 96, 
    strideA = 42467328, 884736, 9216, 96, 1, 
output: TensorDescriptor 0x2b87c40380b0
    type = CUDNN_DATA_HALF 
    nbDims = 5
    dimA = 4, 14, 96, 96, 96, 
    strideA = 12386304, 884736, 9216, 96, 1, 
weight: FilterDescriptor 0x2b87c4038070
    type = CUDNN_DATA_HALF 
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 5
    dimA = 14, 48, 1, 1, 1, 
Pointer addresses: 
    input: 0x2b89c3e00000
    output: 0x2b8994600000 
    weight: 0x2b88d3ffda00 

Environment (please complete the following information):

My system: Python 3.7.7 (default, Sep 22 2022, 13:53:33) [GCC 9.3.0] on linux

NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.7"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.7"
tangy5 commented 2 years ago

HI @naayem , thanks for reporting the issue. The cuDNN error: CUDNN_STATUS_INTERNAL_ERROR is a general error which might related to cuda or cudnn installation. I saw you used your own data here, did you tried the tutorial (BTCV) dataset and default parameters? You can use BTCV to test whether it's the cudaor cudnn problem.

naayem commented 2 years ago

So I did as you suggested and training is running now on BTCV data. I didn't change anything except the input data and json file.

Capture d’écran 2022-11-29 à 19 14 11
tangy5 commented 2 years ago

It works, that's a good. Then the problem might related to the parameters, you could check the number of output channels matches the loss function based on your own data, since the error is raised on the loss.backward.

naayem commented 2 years ago

Yes you're right, I didn't take the time to check all the parameters I should have changed for my new dataset. Then the 'RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR' is solved thank you!