Training terminates after 1300 iterations

DmitryUlyanov / texture_nets

Code for "Texture Networks: Feed-forward Synthesis of Textures and Stylized Images" paper.

Apache License 2.0

1.22k stars 218 forks source link

Training terminates after 1300 iterations #62

Closed ShushanArakelyan closed 7 years ago

ShushanArakelyan commented 7 years ago

Hi,

For some weird reason training with johnson model terminates for me after 1300 iterations with message "Killed". I am using the "default" command line as provided in the readme file for johnson model with different style images (I tried candy style and some custom doodle styles and it is always killed after 1300 iterations). Any ideas on why this happens or how can I deal with this?

Thanks

DmitryUlyanov commented 7 years ago

Hello, I also sometimes experience strange bugs like that (not only in texture nets, but generally with torch), but have no idea what causes it. Other people too: https://github.com/facebook/fb.resnet.torch/issues/140

In this case I bet it is something with threading, it seems to me it is unstable in torch, and both dataloader and nn.DataParallelTable use threading.

DmitryUlyanov commented 7 years ago

Did you try to update your torch install?

ShushanArakelyan commented 7 years ago

My torch install is really fresh (I just got it to try texture_nets out)

DmitryUlyanov commented 7 years ago

Did you install it with

git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh

ShushanArakelyan commented 7 years ago

Yep, I followed the instructions from torch.ch and the installations seemed successful. I was able to successfully train with fast_neural_style though, if that sheds any light on the issue. I am currently trying to train with pyramid model, not sure if it makes any difference, but will let you know if it also terminates after exactly (I find this fact quite surprising) 1300 iterations.

DmitryUlyanov commented 7 years ago

Try this

luarocks install torch
luarocks install sys
luarocks install cutorch
luarocks install loadcaffe
luarocks install nn
luarocks install cunn
luarocks install optim
luarocks install image
luarocks install cudnn

Maybe it will help. Also make sure you have enough disk space.

ShushanArakelyan commented 7 years ago

So, I tried using smaller batch_size, which surprisingly works fine. Am I right assuming it is related to insufficient memory?

michaelhuang74 commented 7 years ago

Memory might be the issue, but not the always one. Sometimes, I set the batch_size = 1 on Tesla K40, the training will terminate itself prematurely without outputting any error message.

ShushanArakelyan commented 7 years ago

Hm, well, now with batch_size = 2 I get an error message that I am running out of RAM, even though I am using amazon p2.xlarge instance with 61GB RAM and 12GB GPU memory, which I assumed would be enough :/ Is there any point in trying to profile the code or anything?

michaelhuang74 commented 7 years ago

You can use nvidia-smi to check how much GPU memory is used when you run the training. If the amount of the used memory is close to the maximum limit, then you may have a chance of running out of RAM.

ShushanArakelyan commented 7 years ago

I tried that for a couple of times while training but could never spot anything weird. If I'm not mistaken, it seemed to only use up to 4Gb GPU memory with batch_size=4 at random points during the training.

alexguo commented 7 years ago

add this line below https://github.com/DmitryUlyanov/texture_nets/blob/master/train.lua#L198 should fix this leak


net:clearState()

michaelhuang74 commented 7 years ago

@alexguo Thanks. I tested it and it worked.

DmitryUlyanov commented 7 years ago

Fixed (https://github.com/DmitryUlyanov/texture_nets/commit/a29853a0072c16289c590173c030a56333249f68), thanks!