Memory limit on a table somewhere?

aaronnech commented 8 years ago

Context of the issue.

I have a big data set and can't get past the loading part of training. From searching I think there is a table somewhere (with possibly the directory of images/folders?) which is (probably) a native lua table (thus subject to the luajit object limit). I'm looking into the issue myself now but I thought I'd make a post here in the case that someone knows where this might be. So I can replace it with a c hash table.

Expected behavior.

Should load any sized dataset

Actual behavior.

When a dataset becomes sufficiently large there is a memory PANIC error similar to this: http://stackoverflow.com/questions/27015150/how-to-get-past-1gb-memory-limit-of-64-bit-luajit-on-linux

OS and hardware information.

64gb RAM, 4 nvidia titans, linux

aaronnech commented 8 years ago

FYI this fix works for hash tables (replacing a lua table almost completely interchangeable syntax wise):

tds = require 'tds' local tbl = tds.Hash()

aaronnech commented 8 years ago

I'm guessing it's in this part of training/dataset.lua. This can easily surpass the luajit memory limit for large datasets. Also I think this is unnecessarily slow (as tableFind I believe makes this O(N^2)).

-- find class names self.classes = {} local classPaths = {} if self.forceClasses then for k,v in pairs(self.forceClasses) do self.classes[k] = v classPaths[k] = {} end end local function tableFind(t, o) for k,v in pairs(t) do if v == o then return k end end end -- loop over each paths folder, get list of unique class names, -- also store the directory paths per class -- for each class, for ,path in ipairs(self.paths) do local dirs = dir.getdirectories(path); for ,dirpath in ipairs(dirs) do local class = paths.basename(dirpath) local idx = tableFind(self.classes, class) if not idx then table.insert(self.classes, class) idx = #self.classes classPaths[idx] = {} end if not tableFind(classPaths[idx], dirpath) then table.insert(classPaths[idx], dirpath); end end end

I made some changes (mostly using the tds hashmap above). I'll comment back if this fixes the issue.

bamos commented 8 years ago

Interesting, thanks! This seems reasonable and I'll take a PR fixing it. I've never had access to a dataset large enough to have this problem. :-)

Also you might be able to speed up the loading process by removing an extra dataset load that's mentioned in #117 if you're hacking around in the data loading code.

-Brandon.

aaronnech commented 8 years ago

Yeah it seems that each "donkey" does the entire loading process.

So it's (N = classes, M = photos per class, K = donkeys) ~O(KN^2) runtime for number of classes N, and at least O(KNM) memory. That memory crashes luajit for large N/M due to native object limits, and it takes over a day to load the dataset.

The memory limit can be fixed by switching to all tensors and tds.Hash tables, as these are not native luajit objects.

I'll play around with making it faster. Surely the file loading need only occur once. I also don't see the need to write all these to disk and load them back to stream them to another file on disk as disk is very expensive, especially with the overhead of many identities (one file per identity in the current code). I believe file lists can fit in memory. Maybe writing to disk is good for resuming training however. This should only require one write.

neuraldomi commented 8 years ago

I just runned into the same issue during the last 2 days. My dataset has 7500 persons and 250000 faces. With a GPU i am running out of memory after a couple of hours. The only output I see:

Argument parsing and import libraries took 52.7680699825 seconds.
Loading the dlib and OpenFace models took 28.5321800709 seconds.
Loading embeddings.
Training for 7782 classes.

I am very interested in solving this problem. I noticed that with smaller datasets (eg. 500 persons with 10000 faces) the training of the classifier goes fairly quick and without problems.

aaronnech commented 8 years ago

This is probably not the same issue. My issue is with loading the dataset paths (e.g. in dataset.lua). Training commences fine. If you are running out of memory in training you might need to change the batch parameters.

I attached my modified dataset.lua. it cuts loading time from 26 hours to 5 minutes for me, and removes the memory issues I had. I'll send a pull request once I test it some more.

dataset.lua.zip

bamos commented 8 years ago

@neuraldomi - let's continue this discussion on another thread if you want (please copy/paste these messages if you start one), but I never intended for the classification demo (which doesn't use dataset.lua) to be used at that scale. For better performance you should add minibatch support to the Python interface, add some garbage collection calls to openface_server.lua, add cudnn support to openface_server.lua, and investigate better Python-based classifiers for this scale (potentially neural net-based). Would be happy for PRs with these.

@aaronnech - awesome! Great improvement. Let me know how testing goes.

aaronnech commented 8 years ago

Testing seems to be fine so far--So I think. I'm running my modified code now. The average loss seems to be sitting around 0.2 (slightly decreasing as hours roll by (think 0.205 to 0.201), I'm on epoch 20 now. Is this normal, or is something up?)

Is an "Epoch" in your implementation a pass over the entire dataset? It seems to be randomly sampling up to some arbitrary N? If I had many identities would it constantly be using identities yet to be seen? (maybe explaining the slow moving loss above?) Should there instead be some counter that runs an "Epoch" over the entire dataset?

Is there any restrictions on just copying one of my intermediate models (e.g. model_15.t7) and benchmark it on LFW?

Thanks!

bamos commented 8 years ago

Great, the loss doesn't go down at the beginning since we don't include easily satisfied triplets, but you should see the LFW accuracy going up. Also an 'epoch' in the training code isn't actually a pass over the data set. It's just defined as the number of iterations that testing and saving the model occurs. You can probably define a pseudo-epoch based on the random sampling and triplet selection if you're interested.

-Brandon.

aaronnech commented 8 years ago

Strange, my LFW score is still at 50% when I follow your proceedure here: https://cmusatyalab.github.io/openface/models-and-accuracies/

I do the following:

Copy the latest epoch model (~100 currently, so work/5/model_100.t7) to openface/models
Complete the procedure above, passing my model in for batch-representation

Is this correct? The dataset I'm training on has many identities, but ~6-10 photos per identity. It has been aligned, etc. 50% is random guessing so I'm sure there's an error somewhere? It isn't throwing any errors, simply happily training away. Output looks like this:

Epoch: [103][279/500]   Time 0.738      tripErr 2.00e-01
  + (nTrips, nTripsFound) = (258, 258)
Epoch: [103][280/500]   Time 0.721      tripErr 2.00e-01
  + (nTrips, nTripsFound) = (65, 65)
Epoch: [103][281/500]   Time 0.474      tripErr 2.00e-01
  + (nTrips, nTripsFound) = (282, 282)  
Epoch: [103][282/500]   Time 0.741      tripErr 2.00e-01
  + (nTrips, nTripsFound) = (76, 76)
Epoch: [103][283/500]   Time 0.480      tripErr 2.00e-01
  + (nTrips, nTripsFound) = (259, 259)  
Epoch: [103][284/500]   Time 0.720      tripErr 2.00e-01
  + (nTrips, nTripsFound) = (317, 317)  
Epoch: [103][285/500]   Time 1.490      tripErr 2.00e-01

Any ideas?

bamos commented 8 years ago

Sure, that looks correct. Also, test.lua should be automatically testing on the LFW after every 'epoch'.

bamos commented 8 years ago

Also I wasn't sure if you were going to send in a PR so I just committed your changes. I've started training a new model on my dataset to make sure your changes here are working well.

aaronnech commented 8 years ago

Awesome. I made a comment on your commit. I think I'm a version or two behind (from when the automatic LFW training scheme was implemented). I'll try updating and training again.

Thanks for all the information!

bamos commented 8 years ago

Also what model are you using? For your scale of data, nn4 or nn2 instead of OpenFace's small variants might work better. But I'd still expect you to do better than 50% LFW accuracy.

bamos commented 8 years ago

Also I noticed you're using 4 GPUs, but the current OpenFace code just uses 1 GPU. @melgor and I have been thinking about adding multi-GPU support in issue #106 if you're interested in contributing some code for faster training. Neither of us are actively working on this.

aaronnech commented 8 years ago

I see--Yeah multiple GPUs could be utilized for sure. On the original dataset.lua change: Were you able to get reasonable accuracies? I'm training on LFW to rule out an issue with loading my data.

aaronnech commented 8 years ago

Something is still wrong. Fresh install leads to 50% accuracy on LFW after 10 epochs training on... LFW.

Any ideas?

bamos commented 8 years ago

I got an error re-training that I think is being caused by this update. For now, I've reverted dataset.lua and am re-training with the old code just to make sure this was causing this.

/home/bamos/torch/install/bin/luajit: /home/bamos/torch/install/share/lua/5.1/threads/threads.lua:264:
[thread 2 callback] ...mos/torch/install/share/lua/5.1/graphicsmagick/Image.lua:333: magick.Image: error loading image: Unable to open file (/home/bamos/openface/data/casia-facescrub/aligned/FlorianHenckelvonDonnersmarck/13.png/home/bamos/openface/data/casia-facescrub/aligned/LaurenHolly/15.png) (ExceptionType=430)
stack traceback:
        [C]: in function 'error'
        ...mos/torch/install/share/lua/5.1/graphicsmagick/Image.lua:333: in function 'magick_error'
        ...mos/torch/install/share/lua/5.1/graphicsmagick/Image.lua:400: in function 'load'
        /home/bamos/openface/training/donkey.lua:34: in function 'sampleHookTrain'
        /home/bamos/openface/training/dataset.lua:395: in function 'samplePeople'
        /home/bamos/openface/training/train.lua:59: in function </home/bamos/openface/training/train.lua:58>
        [C]: in function 'xpcall'
        /home/bamos/torch/install/share/lua/5.1/threads/threads.lua:231: in function 'callback'
        /home/bamos/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/bamos/torch/install/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /home/bamos/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:15: in main chunk
stack traceback:
        [C]: in function 'error'
        /home/bamos/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
        /home/bamos/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
        /home/bamos/openface/training/train.lua:56: in function 'train'
        ./main.lua:44: in main chunk
        [C]: in function 'dofile'
        ...amos/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

training(master)$ cat work/002/test.log
lfwAcc
 5.1880e-01
 5.4630e-01
 5.2730e-01

bamos commented 8 years ago

Moved your dataset changes to dataset-issue-132.lua. Can you check out the latest version of the code and commit further changes here and send in a PR?

I've also added TDS to the Dockerfile so when we move this back to dataset.lua, the Travis build should pass.

bamos commented 8 years ago

Also, training's working well with the old dataset code on my dataset:

training(master)$ cat work/001/test.log
lfwAcc
 7.0320e-01
 7.5700e-01
 7.7550e-01
 7.6080e-01
 8.1830e-01
 8.1150e-01
 8.1480e-01
 7.8850e-01
 8.2980e-01
 8.3920e-01
 8.5170e-01
 8.7100e-01

aaronnech commented 8 years ago

Is python3 a requirement for lfw.py? I'm running python2. I'm wondering if this is the cause of the strange accuracy. I'm now attempting to train on LFW (which should in theory get amazing results when testing on LFW) with the original dataset code.

I think the error you got above is a tensor layout problem due to the size of the entries in the tensor (maxLength). Since it concatenates two adjacent paths when pulling one string out. I think adding 1 was incorrect, I now believe you should add 2 (to accommodate a null terminating character).

bamos commented 8 years ago

I run the LFW script with Python 3, but I don't think it would give wrong accuracies if you run it with Python 2.

melgor commented 8 years ago

I'm running it with Python 2 and everything works ok.

niedz., 8.05.2016 o 21:07 użytkownik Brandon Amos notifications@github.com napisał:

I run the LFW script with Python 3, but I don't think it would give wrong accuracies if you run it with Python 2.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cmusatyalab/openface/issues/132#issuecomment-217739998

aaronnech commented 8 years ago

Interesting. Well it does seem this is a separate (very critical) issue. I'll play around with my setup some more and open a new issue.

Did adding 2 instead of 1 to maxPathLength fix the concatenated path error you experienced?

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cmusatyalab / openface