GPU memory error doing inference on GPU with some version combination

jeromelecoq commented 2 years ago

Doing long inference with Tensorflow 2.7, Python 3.9 can cause gpu out of memory errors like so:

2021-12-01 13:52:12.086386: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 28311552 totalling 27.00MiB 2021-12-01 13:52:12.086398: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 50568192 totalling 48.23MiB 2021-12-01 13:52:12.086405: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 24 Chunks of size 314572800 totalling 7.03GiB 2021-12-01 13:52:12.086412: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 335544320 totalling 320.00MiB 2021-12-01 13:52:12.086419: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 671088640 totalling 640.00MiB 2021-12-01 13:52:12.086426: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 8.11GiB 2021-12-01 13:52:12.086433: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocatedbytes: 10919215104 memorylimit: 10919215104 available bytes: 0 curr_region_allocationbytes: 21838430208 2021-12-01 13:52:12.086445: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats: Limit: 10919215104 InUse: 8706758144 MaxInUse: 9860656896 NumAllocs: 7155 MaxAllocSize: 3528458240 Reserved: 0 PeakReserved: 0 LargestFreeBlock: 0

2021-12-01 13:52:12.086459: W tensorflow/core/common_runtime/bfcallocator.cc:474] ********__***__ 2021-12-01 13:52:12.086525: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at concat_op.cc:158 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[5,192,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "/home/jeromel/Documents/Projects/Deep2P/repos/fine_tuning_jobs/2021-12-01-inference_main.py", line 45, in inference_obj.run() File "/home/jeromel/Documents/Projects/Deep2P/repos/new_deepinterpolation/deepinterpolation/deepinterpolation/cli/inference.py", line 52, in run inferrence_class.run() File "/home/jeromel/Documents/Projects/Deep2P/repos/new_deepinterpolation/deepinterpolation/deepinterpolation/inferrence_collection.py", line 246, in run predictions_data = self.model.predict(local_data[0]) File "/allen/programs/braintv/workgroups/nc-ophys/Jeromel/conda/tf2-7-deepinterp-py39/lib/python3.9/site-packages/keras-2.7.0-py3.9.egg/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/allen/programs/braintv/workgroups/nc-ophys/Jeromel/conda/tf2-7-deepinterp-py39/lib/python3.9/site-packages/tensorflow-2.7.0-py3.9-linux-x86_64.egg/tensorflow/python/eager/execute.py", line 58, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[5,192,512,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/concatenate_4/concat (defined at /allen/programs/braintv/workgroups/nc-ophys/Jeromel/conda/tf2-7-deepinterp-py39/lib/python3.9/site-packages/keras-2.7.0-py3.9.egg/keras/backend.py:3224) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. [Op:__inference_predict_function_677]

jeromelecoq commented 2 years ago

The key to this error is RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[5,192,512,512]

This size array is not requested at any point in the network architecture, which suggests that some memory leakage is occurring in the inference for loop across batches

jeromelecoq commented 2 years ago

For large datasets, the expected use of the .predict function in tensorflow is to feed in the entire dataset and the internal system would loop through it, creating internal batches.

However since our datasets can be exceedingly large (60GB or more), we can't rely on having everyone equipped with 100GB (or more of RAM) for the sake of doing inference. So I initially broke down inference in batches.

See here for the function call : https://github.com/AllenInstitute/deepinterpolation/blob/8a7834c82237ef2e27a6a76b47c4a8e9635da02e/deepinterpolation/inferrence_collection.py#L246

jeromelecoq commented 2 years ago

It turns out tensorflow now has a predict_on_batch function, which is expected to be used in those cases : https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict_on_batch

I found out that just dropping this function in tensorflow 2.7 at the exact line number mentioned above removes this memory leak error

jeromelecoq commented 2 years ago

I will make a PR with this fix and deploy to the main branch. A package release should be done if it proves to be compatible with older tensorflow versions.

asumser commented 2 years ago

I ran into the same issue (Windows 10, Python 3.7, tensorflow 2.4.4). my gpu is not great though and thus could be the actual culprit. when I changed to predict_on_batch I got a different error. Don't know if that is related though...

2021-12-07 01:04:37.560685: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_training_full_args.json INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_training.json INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_generator.json INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_network.json INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_test_generator.json 2021-12-07 01:04:59.217073: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-12-07 01:04:59.218142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2021-12-07 01:04:59.241052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1050 computeCapability: 6.1 coreClock: 1.455GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 104.43GiB/s 2021-12-07 01:04:59.241223: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:04:59.248164: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2021-12-07 01:04:59.248291: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2021-12-07 01:04:59.252300: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2021-12-07 01:04:59.254054: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2021-12-07 01:04:59.262101: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2021-12-07 01:04:59.407947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2021-12-07 01:04:59.409217: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2021-12-07 01:04:59.409409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-12-07 01:04:59.409871: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-12-07 01:04:59.410653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1050 computeCapability: 6.1 coreClock: 1.455GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 104.43GiB/s 2021-12-07 01:04:59.410817: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:04:59.411169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2021-12-07 01:04:59.411543: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2021-12-07 01:04:59.411845: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2021-12-07 01:04:59.412182: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2021-12-07 01:04:59.412520: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2021-12-07 01:04:59.412861: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2021-12-07 01:04:59.413173: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2021-12-07 01:04:59.413268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-12-07 01:04:59.862239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-12-07 01:04:59.862380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-12-07 01:04:59.862763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-12-07 01:04:59.863269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1326 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) 2021-12-07 01:04:59.865190: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set WARNING:tensorflow:period argument is deprecated. Please use save_freq to specify the frequency in number of batches seen. WARNING:tensorflow:period argument is deprecated. Please use save_freq to specify the frequency in number of batches seen. INFO:Training:created objects for training 2021-12-07 01:05:00.392682: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) Epoch 1/17 WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended. WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended. 2021-12-07 01:05:03.109362: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:05:13.244850: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:05:53.199917: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:06:38.756539: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:07:25.575444: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:07:53.252457: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:08:13.651400: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:08:57.551788: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:09:20.196383: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:09:47.127514: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:10:37.204455: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:11:13.151584: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:11:48.301402: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:12:25.945740: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:13:02.681334: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:13:51.455925: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2021-12-07 01:14:15.050249: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2021-12-07 01:14:16.152272: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED 2021-12-07 01:14:16.154185: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED 2021-12-07 01:14:16.155755: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_fused_impl.h:697 : Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. Traceback (most recent call last): File "cli_example_tiny_ophys_training.py", line 72, in trainer.run() File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\deepinterpolation\cli\training.py", line 94, in run training_class.run() File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\deepinterpolation\trainor_collection.py", line 245, in run initial_epoch=0, File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1095, in fit tmp_logs = self.train_function(iterator) File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in call result = self._call(*args, *kwds) File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call return self._stateless_fn(args, **kwds) File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\tensorflow\python\eager\function.py", line 2943, in call filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\tensorflow\python\eager\function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\tensorflow\python\eager\function.py", line 560, in call ctx=ctx) File "C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node model/conv2d/Relu (defined at C:\ProgramData\Anaconda3\envs\deepinterpolation\lib\site-packages\deepinterpolation\trainor_collection.py:245) ]] [Op:__inference_train_function_1736]

Function call stack: train_function

2021-12-07 01:14:23.500441: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

jeromelecoq commented 2 years ago

That error seems to be related to something else. It looks like CUDA has trouble initializing the GPU at all. Maybe make a separate issue.

asumser commented 2 years ago

might be, sorry if not helpful. anyhow the RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[...] error is not there anymore, so it goes a step further on tensorflow 2.4.4

AllenInstitute / deepinterpolation

GPU memory error doing inference on GPU with some version combination #77