Closed aaronnech closed 6 years ago
FYI this fix works for hash tables (replacing a lua table almost completely interchangeable syntax wise):
tds = require 'tds' local tbl = tds.Hash()
I'm guessing it's in this part of training/dataset.lua. This can easily surpass the luajit memory limit for large datasets. Also I think this is unnecessarily slow (as tableFind I believe makes this O(N^2)).
-- find class names self.classes = {} local classPaths = {} if self.forceClasses then for k,v in pairs(self.forceClasses) do self.classes[k] = v classPaths[k] = {} end end local function tableFind(t, o) for k,v in pairs(t) do if v == o then return k end end end -- loop over each paths folder, get list of unique class names, -- also store the directory paths per class -- for each class, for ,path in ipairs(self.paths) do local dirs = dir.getdirectories(path); for ,dirpath in ipairs(dirs) do local class = paths.basename(dirpath) local idx = tableFind(self.classes, class) if not idx then table.insert(self.classes, class) idx = #self.classes classPaths[idx] = {} end if not tableFind(classPaths[idx], dirpath) then table.insert(classPaths[idx], dirpath); end end end
I made some changes (mostly using the tds hashmap above). I'll comment back if this fixes the issue.
Interesting, thanks! This seems reasonable and I'll take a PR fixing it. I've never had access to a dataset large enough to have this problem. :-)
Also you might be able to speed up the loading process by removing an extra dataset load that's mentioned in #117 if you're hacking around in the data loading code.
-Brandon.
Yeah it seems that each "donkey" does the entire loading process.
So it's (N = classes, M = photos per class, K = donkeys) ~O(KN^2) runtime for number of classes N, and at least O(KNM) memory. That memory crashes luajit for large N/M due to native object limits, and it takes over a day to load the dataset.
The memory limit can be fixed by switching to all tensors and tds.Hash tables, as these are not native luajit objects.
I'll play around with making it faster. Surely the file loading need only occur once. I also don't see the need to write all these to disk and load them back to stream them to another file on disk as disk is very expensive, especially with the overhead of many identities (one file per identity in the current code). I believe file lists can fit in memory. Maybe writing to disk is good for resuming training however. This should only require one write.
I just runned into the same issue during the last 2 days. My dataset has 7500 persons and 250000 faces. With a GPU i am running out of memory after a couple of hours. The only output I see:
Argument parsing and import libraries took 52.7680699825 seconds.
Loading the dlib and OpenFace models took 28.5321800709 seconds.
Loading embeddings.
Training for 7782 classes.
I am very interested in solving this problem. I noticed that with smaller datasets (eg. 500 persons with 10000 faces) the training of the classifier goes fairly quick and without problems.
This is probably not the same issue. My issue is with loading the dataset paths (e.g. in dataset.lua). Training commences fine. If you are running out of memory in training you might need to change the batch parameters.
I attached my modified dataset.lua. it cuts loading time from 26 hours to 5 minutes for me, and removes the memory issues I had. I'll send a pull request once I test it some more.
@neuraldomi - let's continue this discussion on another thread if you want (please copy/paste these messages if you start one), but I never intended for the classification demo (which doesn't use dataset.lua
) to be used at that scale. For better performance you should add minibatch support to the Python interface, add some garbage collection calls to openface_server.lua
, add cudnn support to openface_server.lua
, and investigate better Python-based classifiers for this scale (potentially neural net-based). Would be happy for PRs with these.
@aaronnech - awesome! Great improvement. Let me know how testing goes.
Testing seems to be fine so far--So I think. I'm running my modified code now. The average loss seems to be sitting around 0.2 (slightly decreasing as hours roll by (think 0.205 to 0.201), I'm on epoch 20 now. Is this normal, or is something up?)
Is an "Epoch" in your implementation a pass over the entire dataset? It seems to be randomly sampling up to some arbitrary N? If I had many identities would it constantly be using identities yet to be seen? (maybe explaining the slow moving loss above?) Should there instead be some counter that runs an "Epoch" over the entire dataset?
Is there any restrictions on just copying one of my intermediate models (e.g. model_15.t7) and benchmark it on LFW?
Thanks!
Great, the loss doesn't go down at the beginning since we don't include easily satisfied triplets, but you should see the LFW accuracy going up. Also an 'epoch' in the training code isn't actually a pass over the data set. It's just defined as the number of iterations that testing and saving the model occurs. You can probably define a pseudo-epoch based on the random sampling and triplet selection if you're interested.
-Brandon.
Strange, my LFW score is still at 50% when I follow your proceedure here: https://cmusatyalab.github.io/openface/models-and-accuracies/
I do the following:
Is this correct? The dataset I'm training on has many identities, but ~6-10 photos per identity. It has been aligned, etc. 50% is random guessing so I'm sure there's an error somewhere? It isn't throwing any errors, simply happily training away. Output looks like this:
Epoch: [103][279/500] Time 0.738 tripErr 2.00e-01
+ (nTrips, nTripsFound) = (258, 258)
Epoch: [103][280/500] Time 0.721 tripErr 2.00e-01
+ (nTrips, nTripsFound) = (65, 65)
Epoch: [103][281/500] Time 0.474 tripErr 2.00e-01
+ (nTrips, nTripsFound) = (282, 282)
Epoch: [103][282/500] Time 0.741 tripErr 2.00e-01
+ (nTrips, nTripsFound) = (76, 76)
Epoch: [103][283/500] Time 0.480 tripErr 2.00e-01
+ (nTrips, nTripsFound) = (259, 259)
Epoch: [103][284/500] Time 0.720 tripErr 2.00e-01
+ (nTrips, nTripsFound) = (317, 317)
Epoch: [103][285/500] Time 1.490 tripErr 2.00e-01
Any ideas?
Sure, that looks correct. Also, test.lua
should be automatically testing on the LFW after every 'epoch'.
Also I wasn't sure if you were going to send in a PR so I just committed your changes. I've started training a new model on my dataset to make sure your changes here are working well.
Awesome. I made a comment on your commit. I think I'm a version or two behind (from when the automatic LFW training scheme was implemented). I'll try updating and training again.
Thanks for all the information!
Also what model are you using? For your scale of data, nn4 or nn2 instead of OpenFace's small variants might work better. But I'd still expect you to do better than 50% LFW accuracy.
Also I noticed you're using 4 GPUs, but the current OpenFace code just uses 1 GPU. @melgor and I have been thinking about adding multi-GPU support in issue #106 if you're interested in contributing some code for faster training. Neither of us are actively working on this.
I see--Yeah multiple GPUs could be utilized for sure. On the original dataset.lua change: Were you able to get reasonable accuracies? I'm training on LFW to rule out an issue with loading my data.
Something is still wrong. Fresh install leads to 50% accuracy on LFW after 10 epochs training on... LFW.
Any ideas?
I got an error re-training that I think is being caused by this update. For now, I've reverted dataset.lua and am re-training with the old code just to make sure this was causing this.
/home/bamos/torch/install/bin/luajit: /home/bamos/torch/install/share/lua/5.1/threads/threads.lua:264:
[thread 2 callback] ...mos/torch/install/share/lua/5.1/graphicsmagick/Image.lua:333: magick.Image: error loading image: Unable to open file (/home/bamos/openface/data/casia-facescrub/aligned/FlorianHenckelvonDonnersmarck/13.png/home/bamos/openface/data/casia-facescrub/aligned/LaurenHolly/15.png) (ExceptionType=430)
stack traceback:
[C]: in function 'error'
...mos/torch/install/share/lua/5.1/graphicsmagick/Image.lua:333: in function 'magick_error'
...mos/torch/install/share/lua/5.1/graphicsmagick/Image.lua:400: in function 'load'
/home/bamos/openface/training/donkey.lua:34: in function 'sampleHookTrain'
/home/bamos/openface/training/dataset.lua:395: in function 'samplePeople'
/home/bamos/openface/training/train.lua:59: in function </home/bamos/openface/training/train.lua:58>
[C]: in function 'xpcall'
/home/bamos/torch/install/share/lua/5.1/threads/threads.lua:231: in function 'callback'
/home/bamos/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/bamos/torch/install/share/lua/5.1/threads/queue.lua:41>
[C]: in function 'pcall'
/home/bamos/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:15: in main chunk
stack traceback:
[C]: in function 'error'
/home/bamos/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
/home/bamos/torch/install/share/lua/5.1/threads/threads.lua:198: in function 'addjob'
/home/bamos/openface/training/train.lua:56: in function 'train'
./main.lua:44: in main chunk
[C]: in function 'dofile'
...amos/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
training(master)$ cat work/002/test.log
lfwAcc
5.1880e-01
5.4630e-01
5.2730e-01
Moved your dataset changes to dataset-issue-132.lua
. Can you check out the latest version of the code and commit further changes here and send in a PR?
I've also added TDS to the Dockerfile so when we move this back to dataset.lua
, the Travis build should pass.
Also, training's working well with the old dataset code on my dataset:
training(master)$ cat work/001/test.log
lfwAcc
7.0320e-01
7.5700e-01
7.7550e-01
7.6080e-01
8.1830e-01
8.1150e-01
8.1480e-01
7.8850e-01
8.2980e-01
8.3920e-01
8.5170e-01
8.7100e-01
Is python3 a requirement for lfw.py? I'm running python2. I'm wondering if this is the cause of the strange accuracy. I'm now attempting to train on LFW (which should in theory get amazing results when testing on LFW) with the original dataset code.
I think the error you got above is a tensor layout problem due to the size of the entries in the tensor (maxLength). Since it concatenates two adjacent paths when pulling one string out. I think adding 1 was incorrect, I now believe you should add 2 (to accommodate a null terminating character).
I run the LFW script with Python 3, but I don't think it would give wrong accuracies if you run it with Python 2.
I'm running it with Python 2 and everything works ok.
niedz., 8.05.2016 o 21:07 użytkownik Brandon Amos notifications@github.com napisał:
I run the LFW script with Python 3, but I don't think it would give wrong accuracies if you run it with Python 2.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cmusatyalab/openface/issues/132#issuecomment-217739998
Interesting. Well it does seem this is a separate (very critical) issue. I'll play around with my setup some more and open a new issue.
Did adding 2 instead of 1 to maxPathLength fix the concatenated path error you experienced?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Context of the issue.
I have a big data set and can't get past the loading part of training. From searching I think there is a table somewhere (with possibly the directory of images/folders?) which is (probably) a native lua table (thus subject to the luajit object limit). I'm looking into the issue myself now but I thought I'd make a post here in the case that someone knows where this might be. So I can replace it with a c hash table.
Expected behavior.
Should load any sized dataset
Actual behavior.
When a dataset becomes sufficiently large there is a memory PANIC error similar to this: http://stackoverflow.com/questions/27015150/how-to-get-past-1gb-memory-limit-of-64-bit-luajit-on-linux
OS and hardware information.
64gb RAM, 4 nvidia titans, linux