ResourceExhaustedError after several iterations in a grid search

bjtho08 commented 4 years ago

First off, make sure to check your support options.

The preferred way to resolve usage related matters is through the docs which are maintained up-to-date with the latest version of Talos.

If you do end up asking for support in a new issue, make sure to follow the below steps carefully.

1) Confirm the below

[x] I have looked for an answer in the Docs
[x] My Python version is 3.5 or higher
[x] I have searched through the issues Issues for a duplicate
[x] I've tested that my Keras model works as a stand-alone

2) Include the output of:

talos.__version__ == 0.6.7

3) Explain clearly what you are trying to achieve

I am running a grid search that gives 36 rounds. After about 4 or 5 rounds, during a model.fit I suddenly get hit by a ResourceExhaustedError. I think this is very odd given that I am able to complete at least 3 rounds of fitting on the GPU (with a model and batch size that takes up pretty much all the gpu memory), so it seems that there is a small but significant memory leak somewhere. Any ideas what it could be?

bjtho08 commented 4 years ago

My parameter dictionary is:

p = {
    "sigma_noise": [0, 0.01],
    "nb_filters_0": [16, 32, 64],
    "loss_func": ["cat_CE", "tversky_loss", "cat_FL"],
    "arch": ["U-Net"],
    "act": [Swish, ReLU],
}

And I'm running a U-net with 34 million trainable parameters (for nb_filters_0 == 64) and input dimensions of (208, 208, 3) with a batch size of 12 and 400 epochs.

bjtho08 commented 4 years ago

UPDATE: I did a "quick" test, where I ran each model for only 50 epochs and I got a ResourceExhaustedError again in round 4 during the 5th epoch, and I think that was actually the exact same spot as before when each of the 3 previous model had run for +100 epochs. This tells me, that the models are not properly cleaned out of the GPU memory and on top of that, I might have a memory leak in my generator. @mikkokotila, what do you think?

mikkokotila commented 4 years ago

Very interesting. Can you post your full trace.

bjtho08 commented 4 years ago

Of course! See below. I also added the output leading up, because I think it gives some idea of how the exception occurs.

oom-crash

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-10-d1427f7c3b24> in <module>
    104     params=p,
    105     experiment_name="talos/" + date_string,
--> 106     reduction_method='gamify',
    107 )
    108 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, experiment_name, x_val, y_val, val_split, random_method, seed, performance_target, fraction_limit, round_limit, time_limit, boolean_limit, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, minimize_loss, disable_progress_bar, print_params, clear_session, save_weights)
    194         # start runtime
    195         from .scan_run import scan_run
--> 196         scan_run(self)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
     24         # otherwise proceed with next permutation
     25         from .scan_round import scan_round
---> 26         self = scan_round(self)
     27         self.pbar.update(1)
     28 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_round.py in scan_round(self)
     17     # fit the model
     18     from ..model.ingest_model import ingest_model
---> 19     self.model_history, self.round_model = ingest_model(self)
     20     self.round_history.append(self.model_history.history)
     21 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/model/ingest_model.py in ingest_model(self)
      8                       self.x_val,
      9                       self.y_val,
---> 10                       self.round_params)

~/myapps/mmciad/src/mmciad/utils/hyper.py in talos_model(x, y, val_x, val_y, talos_params)
    301                 class_weight=class_weights,
    302                 verbose=internal_params["verbose"],
--> 303                 callbacks=model_callbacks + opti_callbacks,
    304             )
    305         return history, model

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1730             use_multiprocessing=use_multiprocessing,
   1731             shuffle=shuffle,
-> 1732             initial_epoch=initial_epoch)
   1733 
   1734     @interfaces.legacy_generator_methods_support

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    218                                             sample_weight=sample_weight,
    219                                             class_weight=class_weight,
--> 220                                             reset_metrics=False)
    221 
    222                 outs = to_list(outs)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
   1512             ins = x + y + sample_weights
   1513         self._make_train_function()
-> 1514         outputs = self.train_function(ins)
   1515 
   1516         if reset_metrics:

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: OOM when allocating tensor with shape[16,192,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training/Adam/gradients/block1_u_conv1/convolution_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

mikkokotila commented 4 years ago

Have you looked at this SO post.

mikkokotila commented 4 years ago

How much memory your GPU has?

bjtho08 commented 4 years ago

I just check out your link and it does not appear to describe the issue I am having, though at first it did look similar. I am running with an Nvidia GeForce GTX 1080 TI with 11 GB ram.

mikkokotila commented 4 years ago

Can you do this:

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

...and share the output you get.

bjtho08 commented 4 years ago

I would love to, but that option crashes my python kernel, so it's not really possible. This is a long-standing Keras bug, I believe.

mikkokotila commented 4 years ago

Yes, it most certainly is an upstream bug in Keras or TensorFlow.

To avoid doubt, can you share your Scan() command.

Also, how about giving Talos 1.0 a shot. It will use different backend, so you might have better luck.

bjtho08 commented 4 years ago

Sure! I use custom keras.utils.Sequence data generators, so I have two dummy variables for my scan command as shown below:

   dummy_x = np.empty((1, BATCH_SIZE, 208, 208))
    dummy_y = np.empty((1, BATCH_SIZE))

    scan_object = ta.Scan(
        x=dummy_x,
        y=dummy_y,
        disable_progress_bar=False,
        print_params=True,
        model=talos_model,
        params=p,
        experiment_name="talos/" + date_string,
        reduction_method='gamify',
    )

I will take a look at talos 1.0 right away!

bjtho08 commented 4 years ago

So running talos 1.0 had the same outcome, but with a slightly different error message at the end:

 14% |█▌        | 5/36 [1:52:10<11:43:52, 1362.35s/it]
{'act': <class 'keras_contrib.layers.advanced_activations.swish.Swish'>, 'arch': 'U-Net', 'loss_func': 'cat_CE', 'nb_filters_0': 64, 'sigma_noise': 0.01}
tracking <tf.Variable 'block1_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_d_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_d_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block5_bottom_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block5_bottom_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block4_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block3_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block2_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_u_Swish1/scaling_factor:0' shape=() dtype=float32> scaling_factor
tracking <tf.Variable 'block1_u_Swish2/scaling_factor:0' shape=() dtype=float32> scaling_factor
Training |          | 0% 0/5 [00:00<?, ?it/s]
Epoch 0  |██▌        | [loss: 2.2988, acc: 0.1848, jaccard1_coef: 0.0575] : 25% 113/451 [01:19<03:12, 1.76it/s]
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-10-050e8f7c8199> in <module>
    104         params=p,
    105         experiment_name="talos/" + date_string,
--> 106         reduction_method='gamify',
    107     )
    108 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/Scan.py in __init__(self, x, y, params, model, experiment_name, x_val, y_val, val_split, random_method, seed, performance_target, fraction_limit, round_limit, time_limit, boolean_limit, reduction_method, reduction_interval, reduction_window, reduction_threshold, reduction_metric, minimize_loss, disable_progress_bar, print_params, clear_session, save_weights)
    194         # start runtime
    195         from .scan_run import scan_run
--> 196         scan_run(self)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_run.py in scan_run(self)
     24         # otherwise proceed with next permutation
     25         from .scan_round import scan_round
---> 26         self = scan_round(self)
     27         self.pbar.update(1)
     28 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/scan/scan_round.py in scan_round(self)
     17     # fit the model
     18     from ..model.ingest_model import ingest_model
---> 19     self.model_history, self.round_model = ingest_model(self)
     20     self.round_history.append(self.model_history.history)
     21 

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/talos/model/ingest_model.py in ingest_model(self)
      8                       self.x_val,
      9                       self.y_val,
---> 10                       self.round_params)

~/myapps/mmciad/src/mmciad/utils/hyper.py in talos_model(x, y, val_x, val_y, talos_params)
    303                 class_weight=class_weights,
    304                 verbose=internal_params["verbose"],
--> 305                 callbacks=model_callbacks + opti_callbacks,
    306             )
    307         return history, model

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
     89                 warnings.warn('Update your `' + object_name + '` call to the ' +
     90                               'Keras 2 API: ' + signature, stacklevel=2)
---> 91             return func(*args, **kwargs)
     92         wrapper._original_function = func
     93         return wrapper

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1730             use_multiprocessing=use_multiprocessing,
   1731             shuffle=shuffle,
-> 1732             initial_epoch=initial_epoch)
   1733 
   1734     @interfaces.legacy_generator_methods_support

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
    218                                             sample_weight=sample_weight,
    219                                             class_weight=class_weight,
--> 220                                             reset_metrics=False)
    221 
    222                 outs = to_list(outs)

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
   1512             ins = x + y + sample_weights
   1513         self._make_train_function()
-> 1514         outputs = self.train_function(ins)
   1515 
   1516         if reset_metrics:

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

~/.pyenv/versions/miniconda3-4.3.30/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[16,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node GaussianNoise_preout/cond/add-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[metrics/acc/Identity/_1095]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[16,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node GaussianNoise_preout/cond/add-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

mikkokotila commented 4 years ago

Can you run the input model in a loop a few times and see if you can get the same result. If yes, I suggest posting this directly to TensorFlow.

bjtho08 commented 4 years ago

Do you mean a simple loop like this:

model = create_model(*args, **kwargs)

for _ in range(6):
    model.compile(**kwargs)
    model.fit(train_generator, epochs=10)

Or should I try to add some sort of garbage collection to this?

mikkokotila commented 4 years ago

No, just the simplest possible loop.

bjtho08 commented 4 years ago

Is the above code example simple enough or can it be even simpler?

EDIT: BTW, can you maybe elaborate a bit on what was changed in talos 1.0? I tried upgrading to tf-2.2.0rc3 because it fixed a memory leak in fit method related to the keras sequence class.

bjtho08 commented 4 years ago

So far, when running a loop like the one I wrote earlier, I am not getting any ResourceExhaustedError. I have almost completed 5 iterations of the loop with 50 epochs pr iteration. With talos it crashed in the beginning of the fifth iteration.

bjtho08 commented 4 years ago

Okay, so the loop was set to do ten training sessions of 50 epochs since I knew that 50 epochs was enough to get the ResourceExhaustedError after 5 iterations in the talos Scan(). Now it has completed all 10 passes of the loop without any errors whatsoever. I assume this rules out it being a TensorFlow bug?

bjtho08 commented 4 years ago

For good measure, I redid the Scan() just to confirm that updating some of the packages did not alter the outcome. Rather than getting the ResourceExhaustedError, my kernel crashed completely (though it may still be due to a ResourceExhaustedError). Any ideas on how to proceed?

bjtho08 commented 4 years ago

I have now tested it on a different machine with a larger GPU (the NVIDIA Quatro RTX6000 with 24 GB ram) and the same thing happens.

bjtho08 commented 4 years ago

To summarize

The bug(?) appears on two systems with the below configuration(s):

Nvidia GTX 1080 TI or Nvidia Quattro RTX 6000
Nvidia 418.87.00
CUDA 10.1
CuDNN 7.6.5
Python >=3.7.6
Both TensorFlow 1.13, 2.1, and >=2.2.0rc2
Talos >= 0.6.0

It does not appear to happen if a model is compiled and fitted several times in a simple loop, which seem to eliminate the possibility of this being a TensorFlow problem.

mikkokotila commented 4 years ago

Is it possible for you to share a Jupyter notebook or colab which is self-contained, so I can just run and repeat.

Also, is create_model identical in both cases?

bjtho08 commented 4 years ago

Yes, create_model is identical. I will try and see if I can make a self-contained notebook. Currently, my solution has been to modify talos to accept a new boolean parameter allow_resume, which if True savew ParamSpace, list of keys/metrics and the various stores to files on disk and in the event of a crash (or interrupt) it will read these files and restore the important parts of the Scan object before executing scan_run(). It might sacrifice some efficiency, but it sure beats never getting to the finish line ;)

BTW, are there any special considerations behind doing method level imports rather than module/top level imports?

EDIT: If you want, you can have a look at my fork of talos and see what i changed. I haven't committed the latest addition yet, but the primary stuff is in place.

mikkokotila commented 3 years ago

Sorry, I totally missed this.

BTW, are there any special considerations behind doing method level imports rather than module/top level imports?

Yes. Chunks of code are self-contained, readability improves, import only if need etc.

How about we implement the above-said feature into v1.1?

bjtho08 commented 3 years ago

We could do that, but I'm not sure my hack is the best way to go at this point. It makes sense for me, but it could be a lot cleaner, I think. Perhaps storing everything in one file rather than having about three different files to read from :) I will be happy to show you the changes I made, though, and you can decide for yourself what you think of it.

bjtho08 commented 3 years ago

You can look at the changes and additions here: https://github.com/bjtho08/talos/tree/1.0.1-dev

mikkokotila commented 3 years ago

Thanks. Do I understand correctly, the feature is simply to:

allow storing a "restore point" as an option of Scan()
being able to refer to a file the "restore point" is stored

Is there anything I'm missing?

bjtho08 commented 3 years ago

Yep, that pretty much sums it up.

In the project directory (where the logging csv file is stored), three additional files are created: a pickle that contains the various stores from each run; a yaml that lists the remaining instances in the paramspace (dumped from self.param_object) and and a yaml file containing the self._all_keys, self._metric_keys, and self._val_keys.

As I said, it can most likely be done in a cleaner fashion. I just hacked this together in a few days to work around my issue with constant crashing after a few iterations :)

Kristin-Schwarzmuller commented 3 years ago

Hello, have you already found a working solution or a workaround for your problem? Because I am currently facing the same issue.

mikkokotila commented 3 years ago

I will try to work on this next week.

Kristin-Schwarzmuller commented 3 years ago

Disabling eager execution solved the problem for me: tf.compat.v1.disable_eager_execution()

mikkokotila commented 3 years ago

@MolineraNegra wonderful.

@bjtho08 can you confirm if this works for you?

bjtho08 commented 3 years ago

@mikkokotila I'm confused, I thought keras disabled eager execution by default even for tf2.x?

autonomio / talos