Closed sierikov closed 6 years ago
Hi.
Killed
usually means you are running out of memory - OS kills process(es) to survive.
What you can do is to keep more memory free before starting training. Maybe adding swap file/partition. Other possibilities are identical as for running OOM on GPU - lower batch_size
in settings or limit model size.
How much RAM do you have?
I've tried on the server (4 GB RAM + 3 CPU) and then one more time on PC (8 GB RAM) I'm also using Tensorflow 1.4.0 How can I control memory usage for training and how much I need?
Default params are tuned for about 4GB of memory, did you change them? Model should easly fit into 8GB. You don't really know how much memory you need. You can run a model and see if it fits. For tips about limiting required amount of memory, look above.
Settings are default. I didn't touch them.
So make sure you have enough free memory (4GB is not enough when using default settings).
I will try now enable swap and also resize my server to 8 RAM it should help
I have enabled swap for 4 GB. It doesn't help.
To be honest, i'm not sure how much RAM will it need additionally. We were talking about amount of memory needed by tensorflow, but there are couple of other things that need memory, for example to load and process training data.
Hooray!
I have resized my server to 6v CPUs an 16 GB of RAM. It takes 66% of memory and 567% of CPU, but it runs. I will test it with larger swap file (30GB for example) maybe it will work.
Thanks for the help!
Hi. I have tried to install CUDA and toolkit, but there was a problem with gpg key on
.deb
package, so I cannot dosudo apt-get update
.My GPU
NVIDIA 940MX
supports CUDA > 3.0, but I cannot install the driver from the NVIDIA website onUbuntu 16.04
. I installed this driver throughSoftware Update > Additional drivers
.And runfile ends with failure on
Ubuntu 16.04
. So.. I decided to switch from GPU to CPU. I did all the steps and prepared the data for training, but system kills this process afterAlso, I've tried to limit CPU usage for this process with help of my
daemon
andcpulimit
tool, but the process also was killed by the system. What can I do to run training?PS: Size of train data
train.to - 2.2GB
andtrain.from-2.2GB