deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

CUDNN error by using UNet #278

Closed PerloDaniele closed 3 years ago

PerloDaniele commented 3 years ago

CUDNN error by running UNet on docker dhealth/pylibs-toolkit:0.10.0-cudnn

 File "/usr/local/lib/python3.6/dist-packages/pyeddl-0.14.0-py3.6-linux-x86_64.egg/pyeddl/eddl.py", line 450, in train_batch
    return _eddl.train_batch(net, in_, out, indices)
RuntimeError: [CUDNN ERROR]: CUDNN_STATUS_BAD_PARAM (3) raised in cudnnGetConvolutionForwardWorkspaceSize at /usr/local/src/eddl/src/hardware/gpu/nn/gpu_conv.cu file | (check_cudnn)

Model:

def UNet(x, num_classes):
    depth = 64

    # encoder
    x = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x, depth, [3, 3], [1, 1], "same"), True))
    x = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x, depth, [3, 3], [1, 1], "same"), True))
    x2 = eddl.AveragePool(x, [2, 2], [2, 2])#eddl.MaxPool(x, [2, 2], [2, 2])#
    x2 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x2, 2*depth, [3, 3], [1, 1], "same"), True))
    x2 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x2, 2*depth, [3, 3], [1, 1], "same"), True))
    x3 = eddl.AveragePool(x2, [2, 2], [2, 2])#eddl.MaxPool(x2, [2, 2], [2, 2])#
    x3 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x3, 4*depth, [3, 3], [1, 1], "same"), True))
    x3 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x3, 4*depth, [3, 3], [1, 1], "same"), True))
    x4 = eddl.AveragePool(x3, [2, 2], [2, 2])#eddl.MaxPool(x3, [2, 2], [2, 2])#
    x4 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x4, 8*depth, [3, 3], [1, 1], "same"), True))
    x4 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x4, 8*depth, [3, 3], [1, 1], "same"), True))
    x5 = eddl.AveragePool(x4, [2, 2], [2, 2])#eddl.MaxPool(x4, [2, 2], [2, 2])#

    # middle conv
    x5 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x5, 16*depth, [3, 3], [1, 1], "same"), True))
    x5 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x5, 16*depth, [3, 3], [1, 1], "same"), True))

    # decoder
    x5 = eddl.Conv(
        eddl.UpSampling(x5, [2, 2]), 8*depth, [2, 2], [1, 1], "same"
    )
    x4 = eddl.Concat([x4, x5])
    x4 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x4, 8*depth, [3, 3], [1, 1], "same"), True))
    x4 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x4, 8*depth, [3, 3], [1, 1], "same"), True))
    x4 = eddl.Conv(
        eddl.UpSampling(x4, [2, 2]), 4*depth, [2, 2], [1, 1], "same"
    )
    x3 = eddl.Concat([x3, x4])
    x3 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x3, 4*depth, [3, 3], [1, 1], "same"), True))
    x3 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x3, 4*depth, [3, 3], [1, 1], "same"), True))
    x3 = eddl.Conv(
        eddl.UpSampling(x3, [2, 2]), 2*depth, [2, 2], [1, 1], "same"
    )
    x2 = eddl.Concat([x2, x3])
    x2 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x2, 2*depth, [3, 3], [1, 1], "same"), True))
    x2 = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x2, 2*depth, [3, 3], [1, 1], "same"), True))
    x2 = eddl.Conv(
        eddl.UpSampling(x2, [2, 2]), depth, [2, 2], [1, 1], "same"
    )
    x = eddl.Concat([x, x2])
    x = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x, depth, [3, 3], [1, 1], "same"), True))
    x = eddl.ReLu(eddl.BatchNormalization(eddl.Conv(x, depth, [3, 3], [1, 1], "same"), True))

    # final conv
    x = eddl.Conv(x, num_classes, [1, 1])

    return x

By using dhealth/pylibs-toolkit:0.10.0-gpu the model training works fine on gpu.

RParedesPalacios commented 3 years ago

Thanks, we will check it

adcastel commented 3 years ago

Hello, Would you mind to check at the model initialization if there is any warning about padding that is not supported by cuDNN?

Thank you

Adrián Castelló

El 23 abr 2021, a las 18:52, Roberto Paredes Palacios @.***> escribió:

 Thanks, we will check it

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

PerloDaniele commented 3 years ago

Yep, there are some warnings

Building model
CS with low memory setup
Selecting GPU device 0
EDDLL is running on GPU device 0, GeForce RTX 2080 Ti
CuBlas initialized on GPU device 0, GeForce RTX 2080 Ti
CuRand initialized on GPU device 0, GeForce RTX 2080 Ti
CuDNN initialized on GPU device 0, GeForce RTX 2080 Ti
Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later
Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later
Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later
Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later
-------------------------------------------------------------------------------
adcastel commented 3 years ago

Ok. That is because asymmetric padding is nor supported by cuDNN. Sometimes, EDDL generates padding for top but not for the bottom (or left but not right-side) of the data and, EDDL using cuDNN fixes it by adding padding. This feature will be fixed in the next release and the execution will abort if asymmetric padding is detected.

Right now, you should check the padding or to add it manually.

Adrián Castelló

El 24 abr 2021, a las 11:50, PerloDaniele @.***> escribió:

 Yep, there are some warnings

Building model CS with low memory setup Selecting GPU device 0 EDDLL is running on GPU device 0, GeForce RTX 2080 Ti CuBlas initialized on GPU device 0, GeForce RTX 2080 Ti CuRand initialized on GPU device 0, GeForce RTX 2080 Ti CuDNN initialized on GPU device 0, GeForce RTX 2080 Ti Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later Warning: asymmetric padding not supported by cuDNN... fixing ... potential shapes mismatch later

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

salvacarrion commented 3 years ago

Fixed.