Closed mbuckler closed 7 years ago
After poking around it seems that the CUDA error may be because of an issue with CUDA or CUDNN version. What version are you folks running? My dockerfile currently uses CUDA 7.5 and CUDNN 5, but this could be easily changed
@mbuckler did you find out why this is happening? I'm getting the same exact error, and this only happens when training across multiple GPUs. @bamos is there a specific requirement on the version of CUDA/CUDNN?
Hi @Supersak80, I was never able to fix this error when using multiple GPUs unfortunately. Instead I moved to one GPU as a workaround, so the good news is that one GPU works but the bad news is that this error still persists with multiple.
@mbuckler take a look at this: https://github.com/facebook/fb.resnet.torch/issues/139
@bamos is there a specific requirement on the version of CUDA/CUDNN? Your input is greatly appreciated!
Context of the issue.
I've built an nvidia-docker image for training openface with CUDA and am encountering an issue when training/testing the network. My first test was to train with default parameters, which allowed the first epoch to go through, but then I encountered a memory error when testing after the first epoch.
After reducing the size of the test batch and training batch I am able to begin training and testing of the network but it fails around epoch 22. There are two issues which seem worth mentioning. This first is that when testing after each epoch the script successfully computes an accuracy but is followed by some form of a parsing error (shown below). This parsing error doesn't stop training, but what does end training is another error which is also shown below.
Relevant side notes: Turning off testing does not seem to be an option, as passing "false" to the command option -testing rejects it as an incorrect usage
I am currently using 3 GPUs and am given the following warning for every training epoch: warning: could not load nccl, falling back to default communication
Expected behavior.
Expect for the model to be tested without error, and for training to complete.
Actual behavior.
Error during testing
Represent: 12900/13233
Represent: 13000/13233
Represent: 13100/13233
Represent: 13200/13233
Represent: 13233/13233
Loading embeddings.
Error which kills training:
Epoch: [22][8/200] Time 0.150 tripErr 1.90e-01
Epoch: [22][9/200] Time 0.203 tripErr 1.50e-01
Epoch: [22][10/200] Time 0.153 tripErr 1.93e-01
Epoch: [22][11/200] Time 0.150 tripErr 2.58e-01
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/threads/threads.lua:179: [thread 5 endcallback] /root/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /root/torch/install/share/lua/5.1/nn/Container.lua:67: In 21 module of nn.Sequential: In 2 module of nn.DepthConcat: In 4 module of nn.Sequential: /root/torch/install/share/lua/5.1/cudnn/init.lua:162: Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim) stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/cudnn/init.lua:162: in function 'errcheck' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:139: in function 'createIODescriptors' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:187: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:185> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/DepthConcat.lua:34: in function 'updateOutput' /root/torch/install/share/lua/5.1/dpnn/Inception.lua:172: in function </root/torch/install/share/lua/5.1/dpnn/Inception.lua:170> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' /root/torch/install/share/lua/5.1/threads/queue.lua:65: in function </root/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' /root/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:13: in main chunk
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' /root/torch/install/share/lua/5.1/threads/queue.lua:65: in function </root/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' /root/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:13: in main chunk stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' ...t/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:717: in function 'exec' ...t/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:195: in function 'forward' /root/openface/training/train.lua:162: in function </root/openface/training/train.lua:140> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:174: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' /root/openface/training/train.lua:66: in function 'train' main.lua:44: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/threads/threads.lua:179: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' /root/openface/training/train.lua:66: in function 'train' main.lua:44: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670
Steps to reproduce.
I've included the docker method to reproduce the error in the interest of clarity, but to test the issue without docker simply skip steps 0-2
0) Install docker and nvidia-docker
1) Pull the docker image
2) Run the docker image with a volume pointing to where your datasets are stored
3) Preprocess the CASIA dataset as shown here https://cmusatyalab.github.io/openface/training-new-models/
4) Preprocess the LFW dataset as shown here https://cmusatyalab.github.io/openface/models-and-accuracies/#running-the-lfw-experiment
5) Make an lfw directory in ~/openface/data
6) Copy pairs.txt to the directory you created (~/openface/data/lfw/).
7) cd to ~/openface/training
8) Delete any caches which have been previously created for CASIA or LFW. If this is the first time you are using the newly preprocessed CASIA and LFW datasets then no caches will exist and so you can skip this step.
9) Attempt to run training. Inform the script of where the preprocessed CASIA and LFW datasets are located via command line arguments as shown below
OS and hardware information.