autonomio / talos

Hyperparameter Experiments with TensorFlow and Keras
https://autonom.io
MIT License
1.62k stars 268 forks source link

Running out of CPU memory #238

Closed iuria21 closed 5 years ago

iuria21 commented 5 years ago

Hi, I'm trying to Scan a simple modelo with a CNN and a LSTM, with no so big parameters:

def breast_cancer_model(x_train, y_train, x_val, y_val, params):
    model = Sequential()
    model.add(Embedding(len(w2idx), params['emb_size'], input_length=MAX_DOCUMENT_LENGTH))
    model.add(Conv1D(params['first_neuron'],params['cn_shape'],padding='same', activation=relu))#original with 64
    model.add(Dropout(params['do1']))
    model.add((LSTM(params['second_neuron'],return_sequences=True)))#original with 100
    model.add(Attention_tf())
    model.add(Dropout(params['do1']))
    model.add(Dense(len(int_idx2la), activation='softmax'))
    parallel_model = multi_gpu_model(model, gpus=2)

    parallel_model.compile(optimizer=params['optimizer']
                  (lr=lr_normalizer(params['lr'],
                                    params['optimizer'])),
                  loss=params['losses'],
                  metrics=[fmeasure_acc])
    #print(model.summary())
    out = parallel_model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=params['batch_size'], verbose=1, epochs=params['epochs'])
    return out, parallel_model

with

p = {'lr': (0.5, 5, 4),
     'second_neuron':[512,256,64],
     'first_neuron': [512,256,64],
     'cn_shape':[4,5],
     'batch_size': [50, 100, 150],
     'epochs': [1],
     'do1': [0.5],
     #'do2': (0.0, 1.0, 4),
     'emb_size': [10000, 1000],
     'optimizer': [Adam],
     'losses': [binary_crossentropy],
     'activation':[relu],
     'last_activation': [softmax]}

And it's growing the cpu usage until it uses 64GB (my whole RAM memory). I tried to train the model with all of the parameters separately and no more than 5GB are used. Why is this? Thanks

mikkokotila commented 5 years ago

Can you first update to latest version pip install -U git+https://github.com/autonomio/talos@daily-dev and see if the problem persist.

How big is your dataset?

iuria21 commented 5 years ago

The problem persists with the latest version. The dataset is very small, I'm just testing, it's 2.000 sentences padded to 75 length only.

mikkokotila commented 5 years ago

Are you on TensorFlow backend?

mikkokotila commented 5 years ago

Also I'm not sure what do you mean by "running out of CPU memory" as it seems to me that you are running a GPU model.

Related, what is this parallel_model = multi_gpu_model(model, gpus=2).

iuria21 commented 5 years ago

Yes, I'm on TensorFlow backend. The parallel_model was only to run the model with two GPUs, but it was only a try, the problem was there before I tried it. And, as you say, I'm running a GPU model, but I don't know why it's using the whole CPU. I explain myself: I can see that it's running on GPU (with nvidia-smi), and no OOM errors with this, but if I watch the CPU with htop the memory it's beeing filled iteration by iteration, until it's full, and the machine get stuck because of the lack of RAM.

iuria21 commented 5 years ago

Here are some pictures: the GPU: imagen

and the CPU: imagen

imagen (it's in the 160/432 iteration) imagen and this in the 195/432 iteration

iuria21 commented 5 years ago

I have this warning in the beginning, maybe could be related..:



[EDITED] >> removed the trace as it was unrelated
mikkokotila commented 5 years ago

Thanks. Please share your Scan() command as well.

iuria21 commented 5 years ago

Sure, this is the Scan command:

from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
sess = tf.Session(config=config)
set_session(sess)  # set this TensorFlow session as the default session for Keras

t = ta.Scan(x=data,
            y=labels,
            model=breast_cancer_model,
            #grid_downsample=0.01, 
            params=p,
            dataset_name='breast_cancer',
            experiment_no='2',
            val_split = 0.2)

(I'm not doing the downsample because I want to try all the options)

iuria21 commented 5 years ago

Any idea of what can be the reason? :(

EDIT:

I tried an old version with an other dataset where it worked before I updated my GPUs and changed to cuda 10 and tensorflow 1.12. Could it be something related to this?

mikkokotila commented 5 years ago

Definitely yes, CUDA version is a common reason in TensorFlow with these "weird" problems. Better go back to the old version and see what happens. I'm closing this as it has nothing to do with Talos, but feel free to keep the conversation going here and do update what happens when you roll-back the Cuda version.