training error---cudnnFindConvolutionForwardAlgorithm failed

ChangshiFan commented 6 years ago

  @XingangPan  Hello! when i use CULane dataset to train the SCNN model, some errors happend . The errors are below.

=> Training epoch # 1

cudnnFindConvolutionForwardAlgorithm failed: 2 convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA12,128,72,200 -filtA256,128,3,3 12,256,72,200 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT
/home/pro/torch/install/bin/luajit: /home/pro/torch/install/share/lua/5.1/nn/Container.lua:67: In 15 module of nn.Sequential: /home/pro/torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionForwardAlgorithm failed, sizes: convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA12,128,72,200 -filtA256,128,3,3 12,256,72,200 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT stack traceback: [C]: in function 'error' /home/pro/torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'forwardAlgorithm' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:190: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186> [C]: in function 'xpcall' /home/pro/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/pro/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./train.lua:69: in function 'train' main.lua:51: in main chunk [C]: in function 'dofile' ...pro/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /home/pico/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /home/pico/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./train.lua:69: in function 'train' main.lua:51: in main chunk [C]: in function 'dofile' ...pico/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

there are my related setting: nGpu 1, nthread 2, cuda version 8.0, cudnn 6.0 .
I have tried to change batchsize, but it not work. I don't known what caused it. I'll appreciate if you can help me.

XingangPan commented 6 years ago

According to some other issues (https://github.com/allenai/XNOR-Net/issues/22, https://github.com/soumith/dcgan.torch/issues/67), this might due to not enough GPU memory. What's your GPU memory? In my case a 12G GPU (Titanx) would sustain a batchsize of 3. That's why I train the model using 4 GPUs with a batchsize of 12.

ChangshiFan commented 6 years ago

Thanks for your reply. My GPU's memory is 6G（GTX 980Ti）. When I change batchsize to a small number（for example : 6）, it shows "... ... ...out of memory ... ... ...". According to your setting, my batchsize is too big. what do you think I should set the number ? And i'm not sure my computer can training model well.

ChangshiFan commented 6 years ago

This problem has been solved . I changed the batchsize to 1. Thanks for your reply.

XingangPan / SCNN

training error---cudnnFindConvolutionForwardAlgorithm failed #51