cmusatyalab / openface

Face recognition with deep neural networks.
http://cmusatyalab.github.io/openface/
Apache License 2.0
15.15k stars 3.6k forks source link

Bad param training error during training and parsing error during testing #212

Closed mbuckler closed 7 years ago

mbuckler commented 7 years ago

Context of the issue.

I've built an nvidia-docker image for training openface with CUDA and am encountering an issue when training/testing the network. My first test was to train with default parameters, which allowed the first epoch to go through, but then I encountered a memory error when testing after the first epoch.

After reducing the size of the test batch and training batch I am able to begin training and testing of the network but it fails around epoch 22. There are two issues which seem worth mentioning. This first is that when testing after each epoch the script successfully computes an accuracy but is followed by some form of a parsing error (shown below). This parsing error doesn't stop training, but what does end training is another error which is also shown below.

Relevant side notes: Turning off testing does not seem to be an option, as passing "false" to the command option -testing rejects it as an incorrect usage

I am currently using 3 GPUs and am given the following warning for every training epoch: warning: could not load nccl, falling back to default communication

Expected behavior.

Expect for the model to be tested without error, and for training to complete.

Actual behavior.

Error during testing

Represent: 12900/13233
Represent: 13000/13233
Represent: 13100/13233
Represent: 13200/13233
Represent: 13233/13233
Loading embeddings.

Error which kills training:

Epoch: [22][8/200] Time 0.150 tripErr 1.90e-01

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' /root/torch/install/share/lua/5.1/threads/queue.lua:65: in function </root/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' /root/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:13: in main chunk stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' ...t/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:717: in function 'exec' ...t/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:195: in function 'forward' /root/openface/training/train.lua:162: in function </root/openface/training/train.lua:140> [C]: in function 'xpcall' /root/torch/install/share/lua/5.1/threads/threads.lua:174: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' /root/openface/training/train.lua:66: in function 'train' main.lua:44: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 stack traceback: [C]: in function 'error' /root/torch/install/share/lua/5.1/threads/threads.lua:179: in function 'dojob' /root/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' /root/openface/training/train.lua:66: in function 'train' main.lua:44: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

Steps to reproduce.

I've included the docker method to reproduce the error in the interest of clarity, but to test the issue without docker simply skip steps 0-2

0) Install docker and nvidia-docker

1) Pull the docker image

docker pull mbuckler/open-face-cuda

2) Run the docker image with a volume pointing to where your datasets are stored

nvidia-docker run -v /local/path/to/your/datasets/:/datasets -it mbuckler/open-face-cuda /bin/bash

3) Preprocess the CASIA dataset as shown here https://cmusatyalab.github.io/openface/training-new-models/

4) Preprocess the LFW dataset as shown here https://cmusatyalab.github.io/openface/models-and-accuracies/#running-the-lfw-experiment

5) Make an lfw directory in ~/openface/data

6) Copy pairs.txt to the directory you created (~/openface/data/lfw/).

7) cd to ~/openface/training

8) Delete any caches which have been previously created for CASIA or LFW. If this is the first time you are using the newly preprocessed CASIA and LFW datasets then no caches will exist and so you can skip this step.

9) Attempt to run training. Inform the script of where the preprocessed CASIA and LFW datasets are located via command line arguments as shown below

th main.lua -data /datasets/casia/CASIA-WebFace-Aligned -nGPU 3 -nDonkeys 8 -epochSize 200 -peoplePerBatch 10 -imagesPerPerson 10 -lfwDir /datasets/lfw/dlib-affine-sz\:96 -testBatchSize 100

OS and hardware information.

mbuckler commented 7 years ago

After poking around it seems that the CUDA error may be because of an issue with CUDA or CUDNN version. What version are you folks running? My dockerfile currently uses CUDA 7.5 and CUDNN 5, but this could be easily changed

Supersak80 commented 7 years ago

@mbuckler did you find out why this is happening? I'm getting the same exact error, and this only happens when training across multiple GPUs. @bamos is there a specific requirement on the version of CUDA/CUDNN?

mbuckler commented 7 years ago

Hi @Supersak80, I was never able to fix this error when using multiple GPUs unfortunately. Instead I moved to one GPU as a workaround, so the good news is that one GPU works but the bad news is that this error still persists with multiple.

Supersak80 commented 7 years ago

@mbuckler take a look at this: https://github.com/facebook/fb.resnet.torch/issues/139

Supersak80 commented 7 years ago

@bamos is there a specific requirement on the version of CUDA/CUDNN? Your input is greatly appreciated!