Open jeromelecoq opened 2 years ago
The key to this error is RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[5,192,512,512]
This size array is not requested at any point in the network architecture, which suggests that some memory leakage is occurring in the inference for loop across batches
For large datasets, the expected use of the .predict function in tensorflow is to feed in the entire dataset and the internal system would loop through it, creating internal batches.
However since our datasets can be exceedingly large (60GB or more), we can't rely on having everyone equipped with 100GB (or more of RAM) for the sake of doing inference. So I initially broke down inference in batches.
See here for the function call : https://github.com/AllenInstitute/deepinterpolation/blob/8a7834c82237ef2e27a6a76b47c4a8e9635da02e/deepinterpolation/inferrence_collection.py#L246
It turns out tensorflow now has a predict_on_batch function, which is expected to be used in those cases : https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict_on_batch
I found out that just dropping this function in tensorflow 2.7 at the exact line number mentioned above removes this memory leak error
I will make a PR with this fix and deploy to the main branch. A package release should be done if it proves to be compatible with older tensorflow versions.
I ran into the same issue (Windows 10, Python 3.7, tensorflow 2.4.4). my gpu is not great though and thus could be the actual culprit. when I changed to predict_on_batch I got a different error. Don't know if that is related though...
2021-12-07 01:04:37.560685: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_training_full_args.json
INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_training.json
INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_generator.json
INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_network.json
INFO:Training:wrote D:\TestDeepInterpolation\testdir_cli\2021_12_07_01_04_test_generator.json
2021-12-07 01:04:59.217073: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-07 01:04:59.218142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-12-07 01:04:59.241052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 computeCapability: 6.1
coreClock: 1.455GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 104.43GiB/s
2021-12-07 01:04:59.241223: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:04:59.248164: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-12-07 01:04:59.248291: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-12-07 01:04:59.252300: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-12-07 01:04:59.254054: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-12-07 01:04:59.262101: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-12-07 01:04:59.407947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-12-07 01:04:59.409217: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-12-07 01:04:59.409409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-12-07 01:04:59.409871: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-07 01:04:59.410653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 computeCapability: 6.1
coreClock: 1.455GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 104.43GiB/s
2021-12-07 01:04:59.410817: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:04:59.411169: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-12-07 01:04:59.411543: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-12-07 01:04:59.411845: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-12-07 01:04:59.412182: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-12-07 01:04:59.412520: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-12-07 01:04:59.412861: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-12-07 01:04:59.413173: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-12-07 01:04:59.413268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-12-07 01:04:59.862239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-12-07 01:04:59.862380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-12-07 01:04:59.862763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-12-07 01:04:59.863269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1326 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
2021-12-07 01:04:59.865190: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:period
argument is deprecated. Please use save_freq
to specify the frequency in number of batches seen.
WARNING:tensorflow:period
argument is deprecated. Please use save_freq
to specify the frequency in number of batches seen.
INFO:Training:created objects for training
2021-12-07 01:05:00.392682: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/17
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
2021-12-07 01:05:03.109362: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:05:13.244850: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:05:53.199917: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:06:38.756539: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:07:25.575444: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:07:53.252457: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:08:13.651400: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:08:57.551788: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:09:20.196383: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:09:47.127514: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:10:37.204455: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:11:13.151584: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:11:48.301402: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:12:25.945740: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:13:02.681334: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:13:51.455925: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-12-07 01:14:15.050249: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-12-07 01:14:16.152272: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2021-12-07 01:14:16.154185: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2021-12-07 01:14:16.155755: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_fused_impl.h:697 : Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
Traceback (most recent call last):
File "cli_example_tiny_ophys_training.py", line 72, in
Function call stack: train_function
2021-12-07 01:14:23.500441: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]
That error seems to be related to something else. It looks like CUDA has trouble initializing the GPU at all. Maybe make a separate issue.
might be, sorry if not helpful. anyhow the RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[...] error is not there anymore, so it goes a step further on tensorflow 2.4.4
Doing long inference with Tensorflow 2.7, Python 3.9 can cause gpu out of memory errors like so: