Issues with training the network on GPU

aminzabardast commented 1 year ago

There are issues when I try to train the network on GPU.

By adding torch.autograd.set_detect_anomaly(True) to the Training.py the following error appears:

/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py:2494: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  "See the documentation of nn.Upsample for details.".format(mode))
/tmp/pip-req-build-ocx5vxk7/torch/csrc/autograd/python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
  File "Training.py", line 219, in <module>
    train(train_loader, model, optimizer, epoch, save_path, writer)
  File "Training.py", line 45, in train
    preds = model(images)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/lib/PraNet_Res2Net.py", line 134, in forward
    x2 = self.resnet.layer2(x1)     # bs, 512, 44, 44
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/lib/Res2Net_v1b.py", line 82, in forward
    out = self.conv3(out)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)

Traceback (most recent call last):
  File "Training.py", line 219, in <module>
    train(train_loader, model, optimizer, epoch, save_path, writer)
  File "Training.py", line 51, in train
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'CudnnConvolutionBackward' returned nan values in its 0th output.

GewelsJI commented 1 year ago

Hi, @aminzabardast

I don't know what PyTorch version you used. Please ensure install the correct version. If not work, please provide more details.

aminzabardast commented 1 year ago

Hi, @GewelsJI. Thank you for the quick response.

I matched all the requirements by containing it in a docker container. I used this image on Docker hub.

My DockerFile:

FROM pytorch/pytorch:1.3-cuda10.1-cudnn7-runtime
LABEL authors="amin"

RUN conda install pytorch=1.3.1 torchvision=0.4.2 cudatoolkit=10.0 --yes
RUN pip install opencv-python==3.4.2.17 tensorboardX==2.0

Training on CPU (although slow) runs correctly, but training on GPU has this issue. I forked the repository and all my changes are in there.

GewelsJI commented 1 year ago

Hi, @aminzabardast

Could you verify it on local environment? I have not done it on docker image.

aminzabardast commented 1 year ago

@GewelsJI Unfortunately, matching the exact CUDA/cuDNN requirements are a challenge. But conceptually, there should be no difference between what runs in a docker container and a local execution.

GewelsJI commented 1 year ago

@aminzabardast Agree. But I have no relevant experience to provide you on docker. Or you can take a try on my latest project: https://github.com/GewelsJI/DGNet/tree/main/lib_pytorch

GewelsJI / MediaEval2020-IIAI-Med

Issues with training the network on GPU #1