eriklindernoren / PyTorch-GAN

PyTorch implementations of Generative Adversarial Networks.
MIT License
16.22k stars 4.05k forks source link

CUDNN_STATUS_INTERNAL_ERROR when running train. #134

Open bill52547 opened 3 years ago

bill52547 commented 3 years ago

Hi there, I am new in CycleGAN and would like to try the samples, i.e., horse2zebra. I tried the training process on my single-GPU PC and faced a cudnn error. The snippet is shown below.

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 256, 66, 66], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x55ae78135d50
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 256, 66, 66,
    strideA = 1115136, 4356, 66, 1,
output: TensorDescriptor 0x55ae76eb1540
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 256, 64, 64,
    strideA = 1048576, 4096, 64, 1,
weight: FilterDescriptor 0x55ae795a3be0
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 256, 256, 3, 3,
Pointer addresses:
    input: 0x7fa720000000
    output: 0x7fa710000000
    weight: 0x7fa742120000`

My PC is with Ubuntu 18.04 LTS OS, and a 2080Ti inside. The CUDA version is 11.0, installed by running the run file. The CUDNN is v8.0.5, compated with CUDA 11.0 and ubuntu 18.04, from cuDNN Developer Library for Ubuntu18.04 x86_64 (Deb) I installed the pytorch from its own website, with the following link pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

I tried to uninstall and install CUDA, cudnn, and pytorch several times. All of them come with the CUDNN_STATUS_INTERNAL_ERROR.

Is anyone has any idea about how to solve it? Thanks.

Minghao