Open estaudt opened 8 years ago
change nDonkey smaller, may be help
can reproduce, will try to fix. you can train on train
instead of trainval
meanwhile, I think it should not have this problem.
Update: Changing trainval to train and nDonkeys from 6 to 4 worked.
I changed trainval to train in train_coco.sh and ran into the following error.
Loading proposals at { 1 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/train.t7" } Done loading proposals
\nImages 82783 /home/elliot/torch/install/bin/luajit: ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 6 callback] not enough memory stack traceback: [C]: in function 'error' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads' ...are/lua/5.1/torchnet/dataset/paralleldatasetiterator.lua:85: in function '__init' /home/elliot/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/elliot/torch/install/share/lua/5.1/torch/init.lua:87> [C]: in function 'getIterator' train.lua:122: in main chunk [C]: in function 'dofile' ...liot/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670
However, when I then changed nDonkeys from 6 to 4, training commenced. I'm not actually sure what nDonkeys stands for. Regardless, thanks for the tips @szagoruyko and @northeastsquare.
@estaudt reducing nDonkeys
turns off integral loss and increases data loading time
Getting the same error when executing-
train_nGPU=1 test_nGPU=1 ./scripts/train_mulitpathnet_coco.sh
...
model_opt
{
model_conv345_norm : true
model_foveal_exclude : -1
model_het : true
}
/home/vijay/torch/install/bin/luajit: /home/vijay/torch/install/share/lua/5.1/nn/Sequential.lua:29: index out of range
stack traceback:
[C]: in function 'error'
/home/vijay/torch/install/share/lua/5.1/nn/Sequential.lua:29: in function 'remove'
...
Any fix?
I comment out line in models/multipathnet.lua --for i,v in ipairs{9,8,1} do classifier:remove(v) end
doing that results in the following :(
{
1 : CudaTensor - size: 4x3x224x224
2 : CudaTensor - empty
}
...
home/demo/torch/install/bin/luajit: ./modules/ModelParallelTable.lua:357: ModelParallelTable only supports CudaTensor, not torch.FloatTensor stack traceback: [C]: in function 'error' ./modules/ModelParallelTable.lua:357: in function 'type'
As another update, when I reduced nDonkeys, training seemed to run, but spit out NaNs for loss and 0 for everything else.
I'm trying to train with the coco dataset and I run into the following errors. When attempting to train with train_multipathnet_coco.sh, I see this.
train_nGPU=2 test_nGPU=1 ./scripts/train_mulitpathnet_coco.sh ... model_opt { model_conv345_norm : true model_foveal_exclude : -1 model_het : true } /home/elliot/torch/install/bin/luajit: /home/elliot/torch/install/share/lua/5.1/nn/Sequential.lua:29: index out of range stack traceback: [C]: in function 'error' /home/elliot/torch/install/share/lua/5.1/nn/Sequential.lua:29: in function 'remove' /home/elliot/Devel/multipathnet/models/multipathnet.lua:32: in main chunk [C]: in function 'dofile' train.lua:104: in main chunk [C]: in function 'dofile' ...liot/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670
When I attempt to train with train_coco.sh, I see this.
train_nGPU=1 test_nGPU=1 ./scripts/train_coco.sh ... Loading proposals at { 1 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/train.t7" 2 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/val.t7" } Done loading proposals
proposal images 123287
dataset images 118287
images 123287
nImages 118287 PANIC: unprotected error in call to Lua API (not enough memory)
Changing train_nGPU=1 to train_nGPU=2 yields the same output but with a different error. FATAL THREAD PANIC: (pcall) not enough memory FATAL THREAD PANIC: (write) not enough memory
I'm running on Ubuntu 14.04 LTS with two Titan X GPUs and 64GB of RAM. Any ideas?