Bad param training error during training and parsing error during testing

mbuckler commented 7 years ago

Context of the issue.

I've built an nvidia-docker image for training openface with CUDA and am encountering an issue when training/testing the network. My first test was to train with default parameters, which allowed the first epoch to go through, but then I encountered a memory error when testing after the first epoch.

After reducing the size of the test batch and training batch I am able to begin training and testing of the network but it fails around epoch 22. There are two issues which seem worth mentioning. This first is that when testing after each epoch the script successfully computes an accuracy but is followed by some form of a parsing error (shown below). This parsing error doesn't stop training, but what does end training is another error which is also shown below.

Relevant side notes: Turning off testing does not seem to be an option, as passing "false" to the command option -testing rejects it as an incorrect usage

I am currently using 3 GPUs and am given the following warning for every training epoch: warning: could not load nccl, falling back to default communication

Expected behavior.

Expect for the model to be tested without error, and for training to complete.

Actual behavior.

Error during testing

Represent: 12900/13233
Represent: 13000/13233
Represent: 13100/13233
Represent: 13200/13233
Represent: 13233/13233
Loading embeddings.

Reading pairs.
Computing accuracy.
- 0.7388 Plotting. Traceback (most recent call last): File "../evaluation/lfw.py", line 313, in main() File "../evaluation/lfw.py", line 77, in main plotVerifyExp(args.workDir, args.tag) File "../evaluation/lfw.py", line 259, in plotVerifyExp openbrData = pd.read_csv("comparisons/openbr.v1.1.0.DET.csv") File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 420, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 218, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 502, in init self._make_engine(self.engine) File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 610, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/usr/lib/python2.7/dist-packages/pandas/io/parsers.py", line 972, in init self._reader = _parser.TextReader(src, **kwds) File "parser.pyx", line 330, in pandas.parser.TextReader.cinit (pandas/parser.c:3200) File "parser.pyx", line 557, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5559) IOError: File comparisons/openbr.v1.1.0.DET.csv does not exist

Error which kills training:

Epoch: [22][8/200] Time 0.150 tripErr 1.90e-01

(nTrips, nTripsFound) = (450, 450)
Epoch: [22][9/200] Time 0.203 tripErr 1.50e-01
(nTrips, nTripsFound) = (450, 450)
Epoch: [22][10/200] Time 0.153 tripErr 1.93e-01
(nTrips, nTripsFound) = (450, 448)
Epoch: [22][11/200] Time 0.150 tripErr 2.58e-01
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/threads/threads.lua:179: [thread 5 endcallback] /root/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /root/torch/install/share/lua/5.1/nn/Container.lua:67: In 21 module of nn.Sequential: In 2 module of nn.DepthConcat: In 4 module of nn.Sequential: /root/torch/install/share/lua/5.1/cudnn/init.lua:162: Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim) stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/cudnn/init.lua:162: in function 'errcheck' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:139: in function 'createIODescriptors' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:187: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:185> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/DepthConcat.lua:34: in function 'updateOutput' /root/torch/install/share/lua/5.1/dpnn/Inception.lua:172: in function </root/torch/install/share/lua/5.1/dpnn/Inception.lua:170> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' /root/torch/install/share/lua/5.1/threads/queue.lua:65: in function </root/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' /root/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:13: in main chunk

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' /root/torch/install/share/lua/5.1/threads/queue.lua:65: in function </root/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' /root/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:13: in main chunk stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' ...t/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:717: in function 'exec' ...t/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:195: in function 'forward' /root/openface/training/train.lua:162: in function </root/openface/training/train.lua:140> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:174: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' /root/openface/training/train.lua:66: in function 'train' main.lua:44: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/threads/threads.lua:179: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' /root/openface/training/train.lua:66: in function 'train' main.lua:44: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

Steps to reproduce.

I've included the docker method to reproduce the error in the interest of clarity, but to test the issue without docker simply skip steps 0-2

0) Install docker and nvidia-docker

1) Pull the docker image

docker pull mbuckler/open-face-cuda

2) Run the docker image with a volume pointing to where your datasets are stored

nvidia-docker run -v /local/path/to/your/datasets/:/datasets -it mbuckler/open-face-cuda /bin/bash

3) Preprocess the CASIA dataset as shown here https://cmusatyalab.github.io/openface/training-new-models/

4) Preprocess the LFW dataset as shown here https://cmusatyalab.github.io/openface/models-and-accuracies/#running-the-lfw-experiment

5) Make an lfw directory in ~/openface/data

6) Copy pairs.txt to the directory you created (~/openface/data/lfw/).

7) cd to ~/openface/training

8) Delete any caches which have been previously created for CASIA or LFW. If this is the first time you are using the newly preprocessed CASIA and LFW datasets then no caches will exist and so you can skip this step.

9) Attempt to run training. Inform the script of where the preprocessed CASIA and LFW datasets are located via command line arguments as shown below

th main.lua -data /datasets/casia/CASIA-WebFace-Aligned -nGPU 3 -nDonkeys 8 -epochSize 200 -peoplePerBatch 10 -imagesPerPerson 10 -lfwDir /datasets/lfw/dlib-affine-sz\:96 -testBatchSize 100

OS and hardware information.

Operating system: Ubuntu 14.04
Torch version: 7
CPU architecture: Intel Core i7-5960X
GPU type (if using): 3 GeForce GTX 980s

mbuckler commented 7 years ago

After poking around it seems that the CUDA error may be because of an issue with CUDA or CUDNN version. What version are you folks running? My dockerfile currently uses CUDA 7.5 and CUDNN 5, but this could be easily changed

Supersak80 commented 7 years ago

@mbuckler did you find out why this is happening? I'm getting the same exact error, and this only happens when training across multiple GPUs. @bamos is there a specific requirement on the version of CUDA/CUDNN?

mbuckler commented 7 years ago

Hi @Supersak80, I was never able to fix this error when using multiple GPUs unfortunately. Instead I moved to one GPU as a workaround, so the good news is that one GPU works but the bad news is that this error still persists with multiple.

Supersak80 commented 7 years ago

@mbuckler take a look at this: https://github.com/facebook/fb.resnet.torch/issues/139

Supersak80 commented 7 years ago

@bamos is there a specific requirement on the version of CUDA/CUDNN? Your input is greatly appreciated!

cmusatyalab / openface