Closed zackchen-lb closed 2 years ago
Hi @ZEKAICHEN ,
Thanks for your interest here, could you please help test with MONAI docker to avoid some environment issues? @ahatamiz Could you please help double confirm the issue?
Thanks in advance.
Hi @ZEKAICHEN
Would you please specify the hardware that is used in this instance ?
Thanks
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.142.00 Driver Version: 450.142.00 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 33C P8 14W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Hi there,
Thanks for the response! This is running on an AWS ec2 g4dn.xlarge instance with a single T4 GPU.
tried adding CUDA_LAUNCH_BLOCKING=1
to debug, get something new here:
Training (0 / 30000 Steps) (loss=4.19719): 17%|███▋ | 1/6 [00:09<00:46, 9.20s/it]
Traceback (most recent call last):
File "code/experiments/ssl/ssl_finetune_train.py", line 332, in <module>
main()
File "code/experiments/ssl/ssl_finetune_train.py", line 321, in main
global_step, dice_val_best, global_step_best = train(
File "code/experiments/ssl/ssl_finetune_train.py", line 263, in train
loss.backward()
File "/home/chenz53/miniconda/envs/imagingai/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/chenz53/miniconda/envs/imagingai/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 64, 48, 48, 48], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(64, 32, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f036000a990
type = CUDNN_DATA_FLOAT
nbDims = 5
dimA = 4, 64, 48, 48, 48,
strideA = 7077888, 110592, 2304, 48, 1,
output: TensorDescriptor 0x7f036011db30
type = CUDNN_DATA_FLOAT
nbDims = 5
dimA = 4, 32, 48, 48, 48,
strideA = 3538944, 110592, 2304, 48, 1,
weight: FilterDescriptor 0x7f0360034090
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 32, 64, 3, 3, 3,
Pointer addresses:
input: 0x7f0220000000
output: 0x7f0210000000
weight: 0x7f02675c3600
Additional pointer addresses:
grad_output: 0x7f0210000000
grad_weight: 0x7f02675c3600
Backward filter algorithm: 1
Thanks @ZEKAICHEN for providing this information. I believer that it is an OOM error. We used a Titan RTX with 24 G VRAM for this tutorial, but your GPU seem to have 16G which may not be sufficient.
Okay. I am gonna try another larger RAM instance and see. Thanks.
updates:
Tried on a p3.2xlarge with a single V100, it works now but it's kinda weird tho since they have the same RAM size.
Hi team,
Was trying to reproduce the self-supervised learning tutorial and rerun the BTCV fine-tuning UNETR model using the exact same script as you provided. The following is the error reported:
Tried a lot of ways but still cannot figure it out. Any thought what's going on here? Many thanks!