CUDNN_STATUS_BAD_PARAM (3) raised in cudnnConvolutionForward

mdrio commented 2 years ago

Using docker image dhealth/pylibs-toolkit:0.12.2-cudnn and a VGG network I encountered the following error:

[CUDNN ERROR]: CUDNN_STATUS_BAD_PARAM (3) raised in cudnnConvolutionForward at /usr/local/src/eddl/src/hardware/gpu/nn/gpu_conv.cu file: 219 line. | (check_cudnn)

Here is the code to reproduce the error:

import numpy as np
import pyeddl.eddl as eddl
from pyeddl.tensor import Tensor

def create_net():
    in_size = [256, 256]
    num_classes = 2
    in_ = eddl.Input([3, in_size[0], in_size[1]])
    out = create_VGG16(in_, num_classes)
    net = eddl.Model([in_], [out])
    return net

def create_VGG16(in_layer, num_classes, seed=1234, init=eddl.HeNormal):
    x = in_layer
    x = eddl.ReLu(init(eddl.Conv(x, 64, [3, 3]), seed))
    x = eddl.MaxPool(eddl.ReLu(init(eddl.Conv(x, 64, [3, 3]), seed)), [2, 2], [2, 2])
    x = eddl.ReLu(init(eddl.Conv(x, 128, [3, 3]), seed))
    x = eddl.MaxPool(eddl.ReLu(init(eddl.Conv(x, 128, [3, 3]), seed)), [2, 2], [2, 2])
    x = eddl.ReLu(init(eddl.Conv(x, 256, [3, 3]), seed))
    x = eddl.ReLu(init(eddl.Conv(x, 256, [3, 3]), seed))
    x = eddl.MaxPool(eddl.ReLu(init(eddl.Conv(x, 256, [3, 3]), seed)), [2, 2], [2, 2])
    x = eddl.ReLu(init(eddl.Conv(x, 512, [3, 3]), seed))
    x = eddl.ReLu(init(eddl.Conv(x, 512, [3, 3]), seed))
    x = eddl.MaxPool(eddl.ReLu(init(eddl.Conv(x, 512, [3, 3]), seed)), [2, 2], [2, 2])
    x = eddl.ReLu(init(eddl.Conv(x, 512, [3, 3]), seed))
    x = eddl.ReLu(init(eddl.Conv(x, 512, [3, 3]), seed))
    x = eddl.MaxPool(eddl.ReLu(init(eddl.Conv(x, 512, [3, 3]), seed)), [2, 2], [2, 2])
    x = eddl.Reshape(x, [-1])
    x = eddl.ReLu(init(eddl.Dense(x, 256), seed))
    x = eddl.Softmax(eddl.Dense(x, num_classes))
    return x

net = create_net()
eddl.build(
    net,
    eddl.rmsprop(0.00001),
    ["soft_cross_entropy"],
    ["categorical_accuracy"],
    eddl.CS_GPU([1], mem="low_mem"),
)

t1 = Tensor.fromarray(np.empty((13, 3, 256, 256), dtype="uint8") / 255)
t2 = Tensor.fromarray(np.empty((19, 3, 256, 256), dtype="uint8") / 255)

eddl.predict(net, [t1])
eddl.predict(net, [t2])

Here is the log produced on my environment:

Generating Random Table
CS with low memory setup
Building model
Selecting GPU device 0
EDDLL is running on GPU device 0, NVIDIA GeForce RTX 2080 Ti
CuBlas initialized on GPU device 0, NVIDIA GeForce RTX 2080 Ti
CuRand initialized on GPU device 0, NVIDIA GeForce RTX 2080 Ti
CuDNN initialized on GPU device 0, NVIDIA GeForce RTX 2080 Ti
Traceback (most recent call last):
  File "test-cudnn/error-cudnn.py", line 53, in <module>
    eddl.predict(net, [t2])
  File "/usr/local/lib/python3.6/dist-packages/pyeddl-1.1.0-py3.6-linux-x86_64.egg/pyeddl/eddl.py", line 420, in predict
    return _eddl.predict(m, in_)
RuntimeError: [CUDNN ERROR]: CUDNN_STATUS_BAD_PARAM (3) raised in cudnnConvolutionForward at /usr/local/src/eddl/src/hardware/gpu/nn/gpu_conv.cu file: 219 line. | (check_cudnn)

Note that the error does not occur if you swap the order of the tensors, i.e. you change the last two lines into:

eddl.predict(net, [t2])
eddl.predict(net, [t1])

No error occurs when using CPU or GPU without CUDNN.

chavicoski commented 2 years ago

Hi,

We are working on it. We are trying to find out what is happening.

Álvaro

mdrio commented 2 years ago

Ok, thanks.

chavicoski commented 2 years ago

Hi,

We found and solved the problem. There was a problem increasing the batch size with CuDNN. It will be fixed in the next release.

Thanks!

deephealthproject / eddl

CUDNN_STATUS_BAD_PARAM (3) raised in cudnnConvolutionForward #301