Closed ShushanArakelyan closed 7 years ago
Hello, I also sometimes experience strange bugs like that (not only in texture nets, but generally with torch), but have no idea what causes it. Other people too: https://github.com/facebook/fb.resnet.torch/issues/140
In this case I bet it is something with threading, it seems to me it is unstable in torch, and both dataloader and nn.DataParallelTable use threading.
Did you try to update your torch install?
My torch install is really fresh (I just got it to try texture_nets out)
Did you install it with
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh
?
Yep, I followed the instructions from torch.ch and the installations seemed successful. I was able to successfully train with fast_neural_style though, if that sheds any light on the issue. I am currently trying to train with pyramid model, not sure if it makes any difference, but will let you know if it also terminates after exactly (I find this fact quite surprising) 1300 iterations.
Try this
luarocks install torch
luarocks install sys
luarocks install cutorch
luarocks install loadcaffe
luarocks install nn
luarocks install cunn
luarocks install optim
luarocks install image
luarocks install cudnn
Maybe it will help. Also make sure you have enough disk space.
So, I tried using smaller batch_size, which surprisingly works fine. Am I right assuming it is related to insufficient memory?
Memory might be the issue, but not the always one. Sometimes, I set the batch_size = 1 on Tesla K40, the training will terminate itself prematurely without outputting any error message.
Hm, well, now with batch_size = 2 I get an error message that I am running out of RAM, even though I am using amazon p2.xlarge instance with 61GB RAM and 12GB GPU memory, which I assumed would be enough :/ Is there any point in trying to profile the code or anything?
You can use nvidia-smi to check how much GPU memory is used when you run the training. If the amount of the used memory is close to the maximum limit, then you may have a chance of running out of RAM.
I tried that for a couple of times while training but could never spot anything weird. If I'm not mistaken, it seemed to only use up to 4Gb GPU memory with batch_size=4 at random points during the training.
add this line below https://github.com/DmitryUlyanov/texture_nets/blob/master/train.lua#L198 should fix this leak
net:clearState()
@alexguo Thanks. I tested it and it worked.
Hi,
For some weird reason training with johnson model terminates for me after 1300 iterations with message "Killed". I am using the "default" command line as provided in the readme file for johnson model with different style images (I tried candy style and some custom doodle styles and it is always killed after 1300 iterations). Any ideas on why this happens or how can I deal with this?
Thanks