cuDNN error when a single card is running and the image size is small

LutaoChu commented 3 years ago

I have 20 GB memory on my card. It may seem like I don't have enough gpu memory, but crop size is only 128*128 and no rmi loss. Gpu memory usage increases until an error occurs.

shell log:

(pytorch1.4) xxx:~/semantic-segmentation$ python -m runx.runx scripts/train_cityscapes_local.yml -i
None
Torch version: 1.4, 1.4.0
Using regular batch norm
n scales [0.5]
dataset = cityscapes
ignore_label = 255
num_classes = 19
cv split val 0 ['val/munster', 'val/frankfurt', 'val/lindau']
mode val found 500 images
cn num_classes 19
cv split train 0 ['train/aachen', 'train/bochum', 'train/bremen', 'train/cologne', 'train/darmstadt', 'train/dusseldorf', 'train/erfurt', 'train/hamburg', 'train/hanover', 'train/jena', 'train/krefeld', 'train/monchengladbach', 'train/strasbourg', 'train/stuttgart', 'train/tubingen', 'train/ulm', 'train/weimar', 'train/zurich']
mode train found 2975 images
cn num_classes 19
Loading centroid file /home/chulutao/nvidia-assert/uniform_centroids/cityscapes_cv0_tile1024.json
Found 19 centroids
Class Uniform Percentage: 0.5
Class Uniform items per Epoch: 2975
cls 0 len 5866
cls 1 len 5184
cls 2 len 5678
cls 3 len 1312
cls 4 len 1723
cls 5 len 5656
cls 6 len 2769
cls 7 len 4860
cls 8 len 5388
cls 9 len 2440
cls 10 len 4722
cls 11 len 3719
cls 12 len 1239
cls 13 len 5075
cls 14 len 444
cls 15 len 348
cls 16 len 188
cls 17 len 575
cls 18 len 2238
Using Cross Entropy Loss
Using Cross Entropy Loss
Loading weights from: checkpoint=/home/chulutao/nvidia-assert/seg_weights/ocrnet.HRNet_industrious-chicken.pth
=> init weights from normal distribution
=> loading pretrained model /home/chulutao/nvidia-assert/seg_weights/hrnetv2_w48_imagenet_pretrained.pth
Trunk: hrnetv2
Model params = 72.1M
Skipped loading parameter module.ocr.cls_head.weight
Skipped loading parameter module.ocr.cls_head.bias
Skipped loading parameter module.ocr.aux_head.2.weight
Skipped loading parameter module.ocr.aux_head.2.bias
Skipped loading parameter module.scale_attn.conv0.weight
Skipped loading parameter module.scale_attn.bn0.weight
Skipped loading parameter module.scale_attn.bn0.bias
Skipped loading parameter module.scale_attn.bn0.running_mean
Skipped loading parameter module.scale_attn.bn0.running_var
Skipped loading parameter module.scale_attn.bn0.num_batches_tracked
Skipped loading parameter module.scale_attn.conv1.weight
Skipped loading parameter module.scale_attn.bn1.weight
Skipped loading parameter module.scale_attn.bn1.bias
Skipped loading parameter module.scale_attn.bn1.running_mean
Skipped loading parameter module.scale_attn.bn1.running_var
Skipped loading parameter module.scale_attn.bn1.num_batches_tracked
Skipped loading parameter module.scale_attn.conv2.weight
Class Uniform Percentage: 0.5
Class Uniform items per Epoch: 2975
cls 0 len 5866
cls 1 len 5184
cls 2 len 5678
cls 3 len 1312
cls 4 len 1723
cls 5 len 5656
cls 6 len 2769
cls 7 len 4860
cls 8 len 5388
cls 9 len 2440
cls 10 len 4722
cls 11 len 3719
cls 12 len 1239
cls 13 len 5075
cls 14 len 444
cls 15 len 348
cls 16 len 188
cls 17 len 575
cls 18 len 2238
Traceback (most recent call last):
  File "train.py", line 601, in <module>
    main()
  File "train.py", line 451, in main
    train(train_loader, net, optim, epoch)
  File "train.py", line 507, in train
    main_loss.backward()
  File "/home/chulutao/miniconda3/envs/pytorch1.4/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/chulutao/miniconda3/envs/pytorch1.4/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

scripts/train_cityscapes_local.yml:

# Train cityscapes using Mapillary-pretrained weights
# Requires 32GB GPU
# Adjust nproc_per_node according to how many GPUs you have

# CMD: "python -m torch.distributed.launch --nproc_per_node=8 train.py"
CMD: "python train.py"

HPARAMS: [
  {
   dataset: cityscapes,
   cv: 0,
#    syncbn: true,
#    apex: true,
#    fp16: true,
#    crop_size: "1024,2048",
   crop_size: "128,128",
   bs_trn: 1,
   poly_exp: 2,
   lr: 5e-3,
#    rmi_loss: true,
   max_epoch: 175,
#    n_scales: "0.5,1.0,2.0",
   n_scales: "0.5",
   supervised_mscale_loss_wt: 0.05,
   snapshot: "ASSETS_PATH/seg_weights/ocrnet.HRNet_industrious-chicken.pth",
   arch: ocrnet.HRNet_Mscale,
   result_dir: LOGDIR,
   RUNX.TAG: '{arch}',
  },
]

ajtao commented 3 years ago

Hmm, can you try bs_trn: 2? Not sure what the problem is there.

LutaoChu commented 3 years ago

I try bs_trn:2, and encounter the same error message.

ajtao commented 3 years ago

I wonder if it's a pytorch 1.4 bug ... https://github.com/pytorch/pytorch/issues/32395

LutaoChu commented 3 years ago

Thanks! I have solved this problem.

NVIDIA / semantic-segmentation

cuDNN error when a single card is running and the image size is small #84