daniel-kukiela / nmt-chatbot

NMT Chatbot
GNU General Public License v3.0
385 stars 213 forks source link

Killed on 1st epoch #91

Closed sierikov closed 6 years ago

sierikov commented 6 years ago

Hi. I have tried to install CUDA and toolkit, but there was a problem with gpg key on .deb package, so I cannot do sudo apt-get update.

My GPU NVIDIA 940MX supports CUDA > 3.0, but I cannot install the driver from the NVIDIA website on Ubuntu 16.04. I installed this driver through Software Update > Additional drivers.

And runfile ends with failure on Ubuntu 16.04. So.. I decided to switch from GPU to CPU. I did all the steps and prepared the data for training, but system kills this process after

# Start step 0, lr 0.001, Sun Sep  2 10:50:36 2018
# Init train iterator, skipping 0 elements
Killed

Also, I've tried to limit CPU usage for this process with help of my daemon and cpulimit tool, but the process also was killed by the system. What can I do to run training?

PS: Size of train data train.to - 2.2GB and train.from-2.2GB

daniel-kukiela commented 6 years ago

Hi. Killed usually means you are running out of memory - OS kills process(es) to survive. What you can do is to keep more memory free before starting training. Maybe adding swap file/partition. Other possibilities are identical as for running OOM on GPU - lower batch_size in settings or limit model size. How much RAM do you have?

sierikov commented 6 years ago

I've tried on the server (4 GB RAM + 3 CPU) and then one more time on PC (8 GB RAM) I'm also using Tensorflow 1.4.0 How can I control memory usage for training and how much I need?

daniel-kukiela commented 6 years ago

Default params are tuned for about 4GB of memory, did you change them? Model should easly fit into 8GB. You don't really know how much memory you need. You can run a model and see if it fits. For tips about limiting required amount of memory, look above.

sierikov commented 6 years ago

Settings are default. I didn't touch them.

daniel-kukiela commented 6 years ago

So make sure you have enough free memory (4GB is not enough when using default settings).

sierikov commented 6 years ago

I will try now enable swap and also resize my server to 8 RAM it should help

sierikov commented 6 years ago

I have enabled swap for 4 GB. It doesn't help. screenshot from 2018-09-02 13-37-38

daniel-kukiela commented 6 years ago

To be honest, i'm not sure how much RAM will it need additionally. We were talking about amount of memory needed by tensorflow, but there are couple of other things that need memory, for example to load and process training data.

sierikov commented 6 years ago

Hooray! 2018-09-08 22-28-50 connor - digital ocean cpu - croped 2018-09-08 22-33-08 connor - digital ocean - memory - croped

I have resized my server to 6v CPUs an 16 GB of RAM. It takes 66% of memory and 567% of CPU, but it runs. I will test it with larger swap file (30GB for example) maybe it will work.

Thanks for the help!