facebookresearch / multipathnet

A Torch implementation of the object detection network from "A MultiPath Network for Object Detection" (https://arxiv.org/abs/1604.02135)
Other
1.34k stars 275 forks source link

FATAL THREAD PANIC - while training coco #11

Open estaudt opened 8 years ago

estaudt commented 8 years ago

I'm trying to train with the coco dataset and I run into the following errors. When attempting to train with train_multipathnet_coco.sh, I see this.

train_nGPU=2 test_nGPU=1 ./scripts/train_mulitpathnet_coco.sh ... model_opt { model_conv345_norm : true model_foveal_exclude : -1 model_het : true } /home/elliot/torch/install/bin/luajit: /home/elliot/torch/install/share/lua/5.1/nn/Sequential.lua:29: index out of range stack traceback: [C]: in function 'error' /home/elliot/torch/install/share/lua/5.1/nn/Sequential.lua:29: in function 'remove' /home/elliot/Devel/multipathnet/models/multipathnet.lua:32: in main chunk [C]: in function 'dofile' train.lua:104: in main chunk [C]: in function 'dofile' ...liot/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

When I attempt to train with train_coco.sh, I see this.

train_nGPU=1 test_nGPU=1 ./scripts/train_coco.sh ... Loading proposals at { 1 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/train.t7" 2 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/val.t7" } Done loading proposals

proposal images 123287

dataset images 118287

images 123287

nImages 118287 PANIC: unprotected error in call to Lua API (not enough memory)

Changing train_nGPU=1 to train_nGPU=2 yields the same output but with a different error. FATAL THREAD PANIC: (pcall) not enough memory FATAL THREAD PANIC: (write) not enough memory

I'm running on Ubuntu 14.04 LTS with two Titan X GPUs and 64GB of RAM. Any ideas?

northeastsquare commented 8 years ago

change nDonkey smaller, may be help

szagoruyko commented 8 years ago

can reproduce, will try to fix. you can train on train instead of trainval meanwhile, I think it should not have this problem.

estaudt commented 8 years ago

Update: Changing trainval to train and nDonkeys from 6 to 4 worked.

I changed trainval to train in train_coco.sh and ran into the following error.

Loading proposals at { 1 : "/home/elliot/Devel/multipathnet/data/proposals/coco/sharpmask/train.t7" } Done loading proposals

proposal images 82783

dataset images 82783

images 82783

\nImages 82783 /home/elliot/torch/install/bin/luajit: ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 6 callback] not enough memory stack traceback: [C]: in function 'error' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific' ...e/elliot/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads' ...are/lua/5.1/torchnet/dataset/paralleldatasetiterator.lua:85: in function '__init' /home/elliot/torch/install/share/lua/5.1/torch/init.lua:91: in function </home/elliot/torch/install/share/lua/5.1/torch/init.lua:87> [C]: in function 'getIterator' train.lua:122: in main chunk [C]: in function 'dofile' ...liot/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

However, when I then changed nDonkeys from 6 to 4, training commenced. I'm not actually sure what nDonkeys stands for. Regardless, thanks for the tips @szagoruyko and @northeastsquare.

szagoruyko commented 8 years ago

@estaudt reducing nDonkeys turns off integral loss and increases data loading time

veejai commented 8 years ago

Getting the same error when executing-

train_nGPU=1 test_nGPU=1 ./scripts/train_mulitpathnet_coco.sh ... model_opt
{ model_conv345_norm : true model_foveal_exclude : -1 model_het : true } /home/vijay/torch/install/bin/luajit: /home/vijay/torch/install/share/lua/5.1/nn/Sequential.lua:29: index out of range stack traceback: [C]: in function 'error' /home/vijay/torch/install/share/lua/5.1/nn/Sequential.lua:29: in function 'remove' ...

Any fix?

northeastsquare commented 8 years ago

I comment out line in models/multipathnet.lua --for i,v in ipairs{9,8,1} do classifier:remove(v) end

veejai commented 8 years ago

doing that results in the following :(

{
  1 : CudaTensor - size: 4x3x224x224
  2 : CudaTensor - empty
}

...

home/demo/torch/install/bin/luajit: ./modules/ModelParallelTable.lua:357: ModelParallelTable only supports CudaTensor, not torch.FloatTensor stack traceback: [C]: in function 'error' ./modules/ModelParallelTable.lua:357: in function 'type'

estaudt commented 8 years ago

As another update, when I reduced nDonkeys, training seemed to run, but spit out NaNs for loss and 0 for everything else.