Closed iuria21 closed 5 years ago
Can you first update to latest version pip install -U git+https://github.com/autonomio/talos@daily-dev
and see if the problem persist.
How big is your dataset?
The problem persists with the latest version. The dataset is very small, I'm just testing, it's 2.000 sentences padded to 75 length only.
Are you on TensorFlow backend?
Also I'm not sure what do you mean by "running out of CPU memory" as it seems to me that you are running a GPU model.
Related, what is this parallel_model = multi_gpu_model(model, gpus=2)
.
Yes, I'm on TensorFlow backend. The parallel_model was only to run the model with two GPUs, but it was only a try, the problem was there before I tried it. And, as you say, I'm running a GPU model, but I don't know why it's using the whole CPU. I explain myself: I can see that it's running on GPU (with nvidia-smi
), and no OOM errors with this, but if I watch the CPU with htop
the memory it's beeing filled iteration by iteration, until it's full, and the machine get stuck because of the lack of RAM.
Here are some pictures: the GPU:
and the CPU:
(it's in the 160/432 iteration) and this in the 195/432 iteration
I have this warning in the beginning, maybe could be related..:
[EDITED] >> removed the trace as it was unrelated
Thanks. Please share your Scan()
command as well.
Sure, this is the Scan command:
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
sess = tf.Session(config=config)
set_session(sess) # set this TensorFlow session as the default session for Keras
t = ta.Scan(x=data,
y=labels,
model=breast_cancer_model,
#grid_downsample=0.01,
params=p,
dataset_name='breast_cancer',
experiment_no='2',
val_split = 0.2)
(I'm not doing the downsample because I want to try all the options)
Any idea of what can be the reason? :(
EDIT:
I tried an old version with an other dataset where it worked before I updated my GPUs and changed to cuda 10 and tensorflow 1.12. Could it be something related to this?
Definitely yes, CUDA version is a common reason in TensorFlow with these "weird" problems. Better go back to the old version and see what happens. I'm closing this as it has nothing to do with Talos, but feel free to keep the conversation going here and do update what happens when you roll-back the Cuda version.
Hi, I'm trying to Scan a simple modelo with a CNN and a LSTM, with no so big parameters:
with
And it's growing the cpu usage until it uses 64GB (my whole RAM memory). I tried to train the model with all of the parameters separately and no more than 5GB are used. Why is this? Thanks