Project-MONAI / tutorials

MONAI Tutorials
https://monai.io/started.html
Apache License 2.0
1.77k stars 668 forks source link

RuntimeError: CUDA error: an illegal memory access was encountered #513

Closed zackchen-lb closed 2 years ago

zackchen-lb commented 2 years ago

Hi team,

Was trying to reproduce the self-supervised learning tutorial and rerun the BTCV fine-tuning UNETR model using the exact same script as you provided. The following is the error reported:

MONAI version: 0.8.0
Numpy version: 1.21.2
Pytorch version: 1.10.1
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 714d00dffe6653e21260160666c4c201ab66511b

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 3.2.1
scikit-image version: NOT INSTALLED or UNKNOWN VERSION.
Pillow version: 8.4.0
Tensorboard version: 2.7.0
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.11.2
tqdm version: 4.62.3
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.9.0
pandas version: 1.3.5
einops version: 0.3.2
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: 1.22.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Loading dataset: 100%|████████████████████████████████████████████████| 6/6 [00:12<00:00,  2.04s/it]
Loading dataset: 100%|████████████████████████████████████████████████| 6/6 [00:14<00:00,  2.46s/it]
image shape: torch.Size([1, 314, 214, 234]), label shape: torch.Size([1, 314, 214, 234])
No weights were loaded, all weights being used are randomly initialized!
Training (0 / 30000 Steps) (loss=3.45243):  17%|███▋                  | 1/6 [00:09<00:45,  9.09s/it]
Traceback (most recent call last):
  File "code/experiments/ssl/ssl_finetune_train.py", line 332, in <module>
    main()
  File "code/experiments/ssl/ssl_finetune_train.py", line 321, in main
    global_step, dice_val_best, global_step_best = train(
  File "code/experiments/ssl/ssl_finetune_train.py", line 264, in train
    epoch_loss += loss.item()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Tried a lot of ways but still cannot figure it out. Any thought what's going on here? Many thanks!

Nic-Ma commented 2 years ago

Hi @ZEKAICHEN ,

Thanks for your interest here, could you please help test with MONAI docker to avoid some environment issues? @ahatamiz Could you please help double confirm the issue?

Thanks in advance.

ahatamiz commented 2 years ago

Hi @ZEKAICHEN

Would you please specify the hardware that is used in this instance ?

Thanks

zackchen-lb commented 2 years ago

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 33C P8 14W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

zackchen-lb commented 2 years ago

Hi there,

Thanks for the response! This is running on an AWS ec2 g4dn.xlarge instance with a single T4 GPU.

zackchen-lb commented 2 years ago

tried adding CUDA_LAUNCH_BLOCKING=1 to debug, get something new here:

Training (0 / 30000 Steps) (loss=4.19719):  17%|███▋                  | 1/6 [00:09<00:46,  9.20s/it]
Traceback (most recent call last):
  File "code/experiments/ssl/ssl_finetune_train.py", line 332, in <module>
    main()
  File "code/experiments/ssl/ssl_finetune_train.py", line 321, in main
    global_step, dice_val_best, global_step_best = train(
  File "code/experiments/ssl/ssl_finetune_train.py", line 263, in train
    loss.backward()
  File "/home/chenz53/miniconda/envs/imagingai/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/chenz53/miniconda/envs/imagingai/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 64, 48, 48, 48], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(64, 32, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 1]
    stride = [1, 1, 1]
    dilation = [1, 1, 1]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7f036000a990
    type = CUDNN_DATA_FLOAT
    nbDims = 5
    dimA = 4, 64, 48, 48, 48,
    strideA = 7077888, 110592, 2304, 48, 1,
output: TensorDescriptor 0x7f036011db30
    type = CUDNN_DATA_FLOAT
    nbDims = 5
    dimA = 4, 32, 48, 48, 48,
    strideA = 3538944, 110592, 2304, 48, 1,
weight: FilterDescriptor 0x7f0360034090
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 5
    dimA = 32, 64, 3, 3, 3,
Pointer addresses:
    input: 0x7f0220000000
    output: 0x7f0210000000
    weight: 0x7f02675c3600
Additional pointer addresses:
    grad_output: 0x7f0210000000
    grad_weight: 0x7f02675c3600
Backward filter algorithm: 1
ahatamiz commented 2 years ago

Thanks @ZEKAICHEN for providing this information. I believer that it is an OOM error. We used a Titan RTX with 24 G VRAM for this tutorial, but your GPU seem to have 16G which may not be sufficient.

zackchen-lb commented 2 years ago

Okay. I am gonna try another larger RAM instance and see. Thanks.

zackchen-lb commented 2 years ago

updates:

Tried on a p3.2xlarge with a single V100, it works now but it's kinda weird tho since they have the same RAM size.