Ophys Example with sample data on GPU: OOM Error

ChrisWiesbrock commented 2 years ago

Hello there!

We try to run the example tiny ophys training on our computers. By now, it is running perfectly fine, when we just use the CPU, but we face the same error on different systems, if we run it on a GPU. So far, we tried a 2070 Ti, a 1070 and google colab. We use CUDA 11, cuDNN 8.0.4 and tensorflow 2.4.4. The example tiny ephys training is running fine on these GPUs. Is our hardware to weak or is there anything else we might can solve?

Have a nice day and all the best!

Chris

This is the error, we get:

ResourceExhaustedError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_13456/3701641197.py in ----> 1 training_class.run()

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\deepinterpolation\trainor_collection.py in run(self) 243 use_multiprocessing=self.use_multiprocessing, 244 callbacks=self.callbacks_list, --> 245 initial_epoch=0, 246 ) 247 else:

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing) 1134 workers=workers, 1135 use_multiprocessing=use_multiprocessing, -> 1136 return_dict=True) 1137 vallogs = {'val' + name: val for name, val in val_logs.items()} 1138 epoch_logs.update(val_logs)

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\keras\engine\training.py in evaluate(self, x, y, batch_size, verbose, sample_weight, steps, callbacks, max_queue_size, workers, use_multiprocessing, return_dict) 1382 with trace.Trace('test', step_num=step, _r=1): 1383 callbacks.on_test_batch_begin(step) -> 1384 tmp_logs = self.test_function(iterator) 1385 if data_handler.should_sync: 1386 context.async_wait()

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\def_function.py in call(self, *args, *kwds) 826 tracing_count = self.experimental_get_tracing_count() 827 with trace.Trace(self._name) as tm: --> 828 result = self._call(args, **kwds) 829 compiler = "xla" if self._experimental_compile else "nonXla" 830 new_tracing_count = self.experimental_get_tracing_count()

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds) 893 # If we did not create any variables the trace we have is good enough. 894 return self._concrete_stateful_fn._call_flat( --> 895 filtered_flat_args, self._concrete_stateful_fn.captured_inputs) # pylint: disable=protected-access 896 897 def fn_with_cond(inner_args, inner_kwds, inner_filtered_flat_args):

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager) 1917 # No tape is watching; skip to running the function. 1918 return self._build_call_outputs(self._inference_function.call( -> 1919 ctx, args, cancellation_manager=cancellation_manager)) 1920 forward_backward = self._select_forward_and_backward_functions( 1921 args,

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager) 558 inputs=args, 559 attrs=attrs, --> 560 ctx=ctx) 561 else: 562 outputs = execute.execute_with_cancellation(

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name) 58 ctx.ensure_initialized() 59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, ---> 60 inputs, attrs, num_outputs) 61 except core._NotOkStatusException as e: 62 if name is not None:

ResourceExhaustedError: OOM when allocating tensor with shape[20,256,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node model/concatenate_2/concat-0-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_test_function_1959]

Function call stack: test_function

jeromelecoq commented 2 years ago

Hi Chris, I am happy to help with this. What is your batch size for 2p? This can increase the need for GPU memory quite quickly.

ChrisWiesbrock commented 2 years ago

Hi Jerome,

sorry for the late response. I hope you had great holidays.

We reduced the batch size to 1 in order to see if this is the cause of the problem, but we still get the same error message.

We run it in Jupyter Notebook. Are there any known issues about that?

jtchang commented 2 years ago

I ran into this issue on Windows 10, python 3.7, TF2.4.4 on a GTX 1070.

It looks like the fit function is loading the entire validation set which is why it gives an oom error.

Lowering the window and test set size lets me run it on the GPU, but the results are not quite as good.

edit: I did a little more digging, and the GPU was loading the entire validation because of the cache_validation call, instead of passing the generator which respects the batch size. Setting the "caching_validation" field in training_params to False prevents this from happening.

ChrisWiesbrock commented 2 years ago

@jtchang

Thank you so much for your edit! This did the job for me as well.

Now the training is running smoothly in google colab without the OOM-Issue.

tomomano commented 3 months ago

@jtchang This saved my day! Thanks!

For thoese facing the same issue, you want to add this line in your training code.

training_param["caching_validation"] = False

AllenInstitute / deepinterpolation

Ophys Example with sample data on GPU: OOM Error #81